-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train model with Asynchronous mode #492
Comments
You needed to read a little further down the page to use Gurobi: |
Thanks, it worked perfectly. |
Hi, Sorry for re open this issue, but i have a question and opening a new issue might be innecesary. Iteration Simulation Bound Time (s) Proc. ID # Solves
1 3.664092e+07 7.371414e+05 4.740712e+02 1 2484
2 1.954982e+07 9.953921e+06 7.951662e+02 1 4680
3 4.591550e+07 9.953921e+06 8.032407e+02 2 4740
4 4.866310e+07 9.953921e+06 8.118478e+02 3 4800
5 4.878851e+07 9.953921e+06 8.195227e+02 4 4860
6 4.355507e+07 9.953921e+06 8.272206e+02 5 4920
7 1.969249e+07 1.026331e+07 1.138915e+03 1 7116
8 1.469963e+07 1.026350e+07 1.458276e+03 1 9312
9 1.564037e+07 1.200186e+07 1.808019e+03 1 11508
10 1.341304e+07 1.203708e+07 2.155772e+03 1 13704
11 1.816808e+07 1.205720e+07 2.490398e+03 1 15900 It seems that it uses the 5 workers jsut once, and train only with one of them the rest of iterations. I'v treined the model with 30 iterations, and only the first five used more than de first worker. In the docs it says "SDDP.jl will start in serial mode while the initialization takes place. Therefore, in the log you will see that the initial iterations take place on the master thread (Proc. ID = 1), and it is only after while that the solve switches to full parallelism", bit i'm getting something different. Thanks in advance! |
Why does it take 474 seconds do do one iteration? What machine is this on? It looks like you're probably running out of RAM and so the other 4 workers are actually slower than just running in serial mode. |
I don't know why it takes 474 seconds to do one iteration. The problem it's a big hydro-thermal scheduling, with 56 stages (representing like 9 years), more than 200 buses, 150 nodes on the water network. |
The parallel scheme is not well optimized, and it requires a lot of data movement between the processors. I assume for this model the set-up and data movement overhead outweighs the benefit of running in parallel. How do the times look if you run in serial? |
This is the output in serial mode Iteration Simulation Bound Time (s) Proc. ID # Solves
1 4.379315e+07 6.238694e+05 5.137349e+02 1 2196
2 2.398887e+07 1.024725e+07 9.844802e+02 1 4392
3 1.365226e+07 1.025685e+07 1.464048e+03 1 6588
4 2.005606e+07 1.188252e+07 1.904386e+03 1 8784
5 1.751424e+07 1.188255e+07 2.471620e+03 1 10980
6 1.362879e+07 1.259155e+07 2.880270e+03 1 13176
7 1.396052e+07 1.261973e+07 3.248855e+03 1 15372
8 1.366801e+07 1.261982e+07 3.658474e+03 1 17568
9 1.412258e+07 1.262390e+07 4.027489e+03 1 19764
10 1.476644e+07 1.262392e+07 4.426796e+03 1 21960
11 1.387798e+07 1.298193e+07 4.769414e+03 1 24156
12 1.344110e+07 1.299375e+07 5.248231e+03 1 26352
13 1.444331e+07 1.299466e+07 5.842923e+03 1 28548
14 1.375559e+07 1.336322e+07 6.328565e+03 1 30744
15 1.335609e+07 1.336351e+07 6.852967e+03 1 32940
16 1.357505e+07 1.340325e+07 7.329363e+03 1 35136
17 1.329446e+07 1.340336e+07 7.866351e+03 1 37332
18 1.822702e+07 1.340367e+07 8.441247e+03 1 39528
19 1.401540e+07 1.340401e+07 8.839535e+03 1 41724
20 1.349358e+07 1.340825e+07 9.145367e+03 1 43920 It converges in fewer iterations, as we would expect. Per iteration is a little slower (iteration from the thread 1). |
There's something not quite right about your model. Does it really take 200ms per LP solve? What solver are you using? Is it an LP? Can you provide the full log from SDDP.jl (including all the stuff at the top)? |
I'm using Gurobi 9.1 and it's an LP. The full SDDP log is this: ------------------------------------------------------------------------------
SDDP.jl (c) Oscar Dowson, 2017-21
Problem
Nodes : 36
State variables : 25
Scenarios : 1.03144e+64
Existing cuts : false
Subproblem structure : (min, max)
Variables : (21107, 21107)
GenericAffExpr{Float64,VariableRef} in MOI.GreaterThan{Float64} : (3352, 3354)
VariableRef in MOI.LessThan{Float64} : (1, 1)
GenericAffExpr{Float64,VariableRef} in MOI.Interval{Float64} : (5585, 5585)
VariableRef in MOI.GreaterThan{Float64} : (19568, 19568)
GenericAffExpr{Float64,VariableRef} in MOI.LessThan{Float64} : (5050, 5051)
GenericAffExpr{Float64,VariableRef} in MOI.EqualTo{Float64} : (6506, 6516)
Options
Solver : serial mode
Risk measure : SDDP.Expectation()
Sampling scheme : SDDP.InSampleMonteCarlo
Numerical stability report
Non-zero Matrix range [8e-05, 1e+03]
Non-zero Objective range [9e-02, 1e+07]
Non-zero Bounds range [5e+05, 5e+05]
Non-zero RHS range [1e-01, 4e+04]
WARNING: numerical stability issues detected
- Matrix range contains small coefficients
Very large or small absolute values of coefficients
can cause numerical stability issues. Consider
reformulating the model.
Iteration Simulation Bound Time (s) Proc. ID # Solves
1 4.379315e+07 6.238694e+05 5.137349e+02 1 2196
2 2.398887e+07 1.024725e+07 9.844802e+02 1 4392
3 1.365226e+07 1.025685e+07 1.464048e+03 1 6588
4 2.005606e+07 1.188252e+07 1.904386e+03 1 8784
5 1.751424e+07 1.188255e+07 2.471620e+03 1 10980
6 1.362879e+07 1.259155e+07 2.880270e+03 1 13176
7 1.396052e+07 1.261973e+07 3.248855e+03 1 15372
8 1.366801e+07 1.261982e+07 3.658474e+03 1 17568
9 1.412258e+07 1.262390e+07 4.027489e+03 1 19764
10 1.476644e+07 1.262392e+07 4.426796e+03 1 21960
11 1.387798e+07 1.298193e+07 4.769414e+03 1 24156
12 1.344110e+07 1.299375e+07 5.248231e+03 1 26352
13 1.444331e+07 1.299466e+07 5.842923e+03 1 28548
14 1.375559e+07 1.336322e+07 6.328565e+03 1 30744
15 1.335609e+07 1.336351e+07 6.852967e+03 1 32940
16 1.357505e+07 1.340325e+07 7.329363e+03 1 35136
17 1.329446e+07 1.340336e+07 7.866351e+03 1 37332
18 1.822702e+07 1.340367e+07 8.441247e+03 1 39528
19 1.401540e+07 1.340401e+07 8.839535e+03 1 41724
20 1.349358e+07 1.340825e+07 9.145367e+03 1 43920
Terminating training
Status : iteration_limit
Total time (s) : 9.145367e+03
Total solves : 43920
Best bound : 1.340825e+07
Simulation CI : 1.653153e+07 ± 3.064649e+06
------------------------------------------------------------------------------ I'm using gurobi parameter numeric focus = 3. |
So a few things:
|
Thanks for the answer.
Also, i've tested the asynchronous mode with a slightly modified model than the begining (note this in the matrix range). I know that i have a better model in term of range rigth now, but i'm running that model. SDDP.jl (c) Oscar Dowson, 2017-21
Problem
Nodes : 36
State variables : 25
Scenarios : 1.03144e+64
Existing cuts : false
Subproblem structure : (min, max)
Variables : (21107, 21107)
GenericAffExpr{Float64,VariableRef} in MOI.GreaterThan{Float64} : (3352, 3354)
VariableRef in MOI.LessThan{Float64} : (1, 1)
GenericAffExpr{Float64,VariableRef} in MOI.Interval{Float64} : (5585, 5585)
VariableRef in MOI.GreaterThan{Float64} : (19568, 19568)
GenericAffExpr{Float64,VariableRef} in MOI.LessThan{Float64} : (5050, 5051)
GenericAffExpr{Float64,VariableRef} in MOI.EqualTo{Float64} : (6506, 6516)
Options
Solver : Asynchronous mode with 4 workers.
Risk measure : SDDP.Expectation()
Sampling scheme : SDDP.InSampleMonteCarlo
Numerical stability report
Non-zero Matrix range [7e-04, 1e+04]
Non-zero Objective range [9e-02, 1e+06]
Non-zero Bounds range [5e+05, 5e+05]
Non-zero RHS range [1e-01, 4e+04]
No problems detected
Iteration Simulation Bound Time (s) Proc. ID # Solves
1 5.423759e+07 8.267030e+05 4.416711e+02 1 2196
2 2.940131e+07 8.782821e+06 7.309163e+02 1 4392
3 5.217217e+07 8.782821e+06 7.387564e+02 2 4452
4 4.352296e+07 8.782821e+06 7.465079e+02 3 4512
5 5.251501e+07 8.782821e+06 7.541364e+02 5 4572
6 4.686811e+07 8.782821e+06 7.621236e+02 4 4632
7 1.719751e+07 1.037191e+07 1.096181e+03 1 6828
8 1.844196e+07 1.037192e+07 1.422216e+03 1 9024
9 1.547676e+07 1.118443e+07 1.725801e+03 1 11220
10 1.574095e+07 1.118467e+07 2.099916e+03 1 13416
11 1.661916e+07 1.118481e+07 2.460442e+03 1 15612
12 1.545823e+07 1.118489e+07 2.782426e+03 1 17808
13 1.357699e+07 1.271447e+07 3.092544e+03 1 20004
14 1.477563e+07 1.271460e+07 3.466616e+03 1 22200
15 1.953569e+07 1.271463e+07 3.803677e+03 1 24396
16 1.790396e+07 1.274343e+07 4.153817e+03 1 26592
17 1.501449e+07 1.274408e+07 4.489968e+03 1 28788
18 1.861698e+07 1.274434e+07 4.840596e+03 1 30984
19 1.214280e+07 1.284335e+07 5.198810e+03 1 33180
20 1.743427e+07 1.284566e+07 5.553337e+03 1 35376
21 1.358681e+07 1.287126e+07 5.906797e+03 1 37572
22 1.456575e+07 1.287127e+07 6.252890e+03 1 39768
23 1.869709e+07 1.298196e+07 6.575987e+03 1 41964
24 1.438251e+07 1.298244e+07 6.954527e+03 1 44160
25 1.394985e+07 1.298285e+07 7.338875e+03 1 46356
26 1.427660e+07 1.298287e+07 7.682910e+03 1 48552
27 1.350326e+07 1.298307e+07 8.120624e+03 1 50748
28 1.367676e+07 1.308422e+07 8.451969e+03 1 52944
29 1.396336e+07 1.308630e+07 8.860270e+03 1 55140
30 1.664874e+07 1.308645e+07 9.281819e+03 1 57336
Terminating training
Status : iteration_limit
Total time (s) : 9.281819e+03
Total solves : 57336
Best bound : 1.308645e+07
Simulation CI : 2.179678e+07 ± 4.737844e+06
------------------------------------------------------------------------------ I know that the time es slow, but the strange thing it's that it's not using the other workers again. Thanks for all! |
There's probably something todo with the channels to the remote processes disconnecting due to the time it takes to receive. Unfortunately, I don't have the time to look into this in any detail. (What institution are you with?) Here's the majority of the parallel code. It'd take some digging to find the problem: SDDP.jl/src/plugins/parallel_schemes.jl Lines 170 to 264 in 5ec8a2e
|
Thanks for the answer. Thanks! |
Ah so this is for the new Chilean model? Try the undocumented option parallel_scheme = SDDP.Asynchronous(use_master = false) do m
env = Gurobi.Env()
for node in values(m.nodes)
set_optimizer(node.subproblem, () -> Gurobi.Optimizer(env))
set_silent(node.subproblem)
end
end |
Yes, i'm working with the Chilean model. ------------------------------------------------------------------------------
SDDP.jl (c) Oscar Dowson, 2017-21
Problem
Nodes : 36
State variables : 26
Scenarios : 1.03144e+64
Existing cuts : false
Subproblem structure : (min, max)
Variables : (21109, 21109)
GenericAffExpr{Float64,VariableRef} in MOI.GreaterThan{Float64} : (3352, 3354)
VariableRef in MOI.LessThan{Float64} : (1, 1)
GenericAffExpr{Float64,VariableRef} in MOI.Interval{Float64} : (5585, 5585)
VariableRef in MOI.GreaterThan{Float64} : (19568, 19568)
GenericAffExpr{Float64,VariableRef} in MOI.LessThan{Float64} : (5050, 5051)
GenericAffExpr{Float64,VariableRef} in MOI.EqualTo{Float64} : (6507, 6517)
Options
Solver : Asynchronous mode with 7 workers.
Risk measure : SDDP.Expectation()
Sampling scheme : SDDP.InSampleMonteCarlo
Numerical stability report
Non-zero Matrix range [1e-03, 1e+03]
Non-zero Objective range [9e-02, 1e+06]
Non-zero Bounds range [5e+05, 5e+05]
Non-zero RHS range [1e-02, 4e+04]
No problems detected
Iteration Simulation Bound Time (s) Proc. ID # Solves
1 1.104024e+08 6.981477e+05 5.127468e+02 3 60
2 1.067315e+08 5.849218e+05 5.284406e+02 2 120
3 9.880152e+07 6.981528e+05 5.483199e+02 4 180
4 1.176214e+08 5.849218e+05 5.630537e+02 5 240
5 1.101718e+08 5.849218e+05 5.796307e+02 6 300
6 1.051674e+08 5.849218e+05 5.951814e+02 7 360
7 1.037979e+08 6.981528e+05 6.077584e+02 8 420
8 5.060291e+07 5.755976e+06 9.617168e+02 3 480
9 5.258060e+07 5.788610e+06 9.809755e+02 2 540
10 5.360229e+07 5.851016e+06 1.007858e+03 5 600
11 5.022065e+07 5.851016e+06 1.019422e+03 6 660
12 4.750682e+07 5.851016e+06 1.030757e+03 4 720
13 5.085912e+07 5.851016e+06 1.042366e+03 7 780
14 4.859930e+07 5.851016e+06 1.079381e+03 8 840
15 1.957667e+07 5.851016e+06 1.392243e+03 3 900
16 2.180245e+07 5.851016e+06 1.403444e+03 2 960
17 2.287774e+07 5.912278e+06 1.415303e+03 5 1020
18 1.647738e+07 5.912278e+06 1.426319e+03 6 1080
19 2.357789e+07 6.048069e+06 1.436571e+03 7 1140
20 2.668270e+07 6.048069e+06 1.447359e+03 4 1200
21 2.818930e+07 6.048069e+06 1.490012e+03 8 1260
22 1.360967e+07 6.048069e+06 1.774825e+03 3 1320
23 1.069678e+07 6.048069e+06 1.801416e+03 2 1380
24 1.635207e+07 7.168963e+06 1.819407e+03 6 1440
25 1.477921e+07 7.168963e+06 1.837739e+03 5 1500
26 2.053416e+07 7.168963e+06 1.850033e+03 7 1560
27 2.138778e+07 7.168963e+06 1.867305e+03 4 1620
28 1.942672e+07 7.168963e+06 1.909715e+03 8 1680
29 1.574227e+07 7.168963e+06 2.190866e+03 3 1740
30 1.549916e+07 7.168963e+06 2.224452e+03 2 1800
Terminating training
Status : iteration_limit
Total time (s) : 2.224452e+03
Total solves : 1800
Best bound : 7.168963e+06
Simulation CI : 4.712925e+07 ± 1.306993e+07
------------------------------------------------------------------------------ Thanks! |
Okay. I obviously have some scheduling issues switching between the serial and parallel modes.
Yes. This is true. You'll need more cuts to achieve the same bound compared with serial mode. |
I've been testing, and with Thanks! |
Okay, great that it's working. At some point I'll look into the master scheduling issue. Be very careful with how you measure convergence. If you have 26 state variables and 36 stages, you're likely going to need hundreds or thousands of iterations. Read #178. |
Actually, based on #178 i'm using SDDP.statistical with a 2.5% confidence interval to define the convergence of the model. Like with 50 or 55 train iterations the model converges with 3 extra workers, so i use 60 right now. By the way, it needs really fewer iterations when i use a reasonable lower bound for the objective function. |
That suggests the myopic policy (do the best now, ignore the future) is near optimal. I'd still try a run with a lot more iterations (like 500) can compare plots of the two policies. |
Closing because there doesn't seem to be anything actionable here. I'm aware that the parallel scheme needs work, but I don't have any concrete plans to work on it. If, future reader, this is important for you, I'm available for paid consulting. |
Hi!
I'm trying to try my model with the Asynchronous mode, based on the docs.
At the first lines of the script, i'm running this:
Notice that i've added the statement "using Gurobi, SDDP" because i found it at other post.
Actually, i am creating my model by this way:
When i try to train my model:
I can only see Outputs from the "worker 1", and nothing from the other workers. Also, at the end of the iterations of train, i get the next error:
I can't figure out where is the problem here.
Thanks in advance!
The text was updated successfully, but these errors were encountered: