-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CP-SAT multi-thread crashing python w/ large number of workers #2019
Comments
|
Unfortunatelly current version of cp-sat solver doesn't work in windows since version 7.5. Hopefully the problem will be solved with issues like #1918 that has fix version 7.7 and also mention memory access errors. |
|
are you running multiple solvers in parallel ?
Laurent Perron | Operations Research | lperron@google.com | (33) 1 42 68 53
00
Le ven. 15 mai 2020 à 22:17, gregy4 <notifications@github.com> a écrit :
… Unfortunatelly current version of cp-sat solver doesn't work in windows
since version 7.5. Hopefully the problem will be solved with issues like
#1918 <#1918> that has fix
version 7.7 and also mention memory access errors.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2019 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUPL3LXEI4JOOMO6GG36ZDRRWPMJANCNFSM4NA6QTKA>
.
|
|
Most of the time I don’t execute multiple solvers in parallel. For about one or two weeks I used two instances of solver with 8 search workers in one process (jvm) as a service to calculate different schedules and I didn’t found a problem. Number of calculated of schedules was in hundreds during one test. I used or-tools 7.4. |
|
I changed my program such that there are no intermediate solvers running while building the main models. Now my solvers, and model building, is strictly sequential. The input I'm testing with takes 157 solves to complete. With 100 workers it reaches around 34 solves on average before having the !is_present crash, 50-20 workers reach around 50 solves on average. Moving down to 8 workers it will make it above 140 solves, though rarely completing all 157 solves. Sometimes I get infeasible in the 140s, which doesn't seem correct when looking at the input the solver gets stuck on. I never encountered infeasible using 7.5 on Windows with this series of inputs. |
|
Can you test test the workaround for the crash: parameter |
|
and please use 7.6 or master. We have fixed a lot or presolve bugs since 7.4. |
|
Can you save the model proto at each step and send me the crashing one? |
|
Using 7.6.7691 with Python on Windows 7, smaller machine (faster crashes), 100 workers, and solver.parameters.catch_sigint_signal = True seemed to make no difference when it comes to the Python crashes. I made sure the system wasn't hitting a memory wall. Windows100WithoutSigInt.zip This is the series of model protos up to the crashing proto at 100 workers without SigInt. Crashing model proto is Output_Proto_10_74.txt I will test, and get proto files, on Linux sometime today or tomorrow. Thanks! |
|
On Oracle Linux 8.2, with or without the solver.parameters.catch_sigint_signal = True flag I still get !is_present crashing. OracleLinux100WithoutSigInt.zip This is the series of model protos up to the crashing proto at 100 workers without SigInt. Crashing model proto is Output_Proto_34_74.txt OracleLinux100WithSigInt.zip This is the series of model protos up to the crashing proto at 100 workers with SigInt. Crashing model proto is Ooutput_Proto_51_74.txt Crashing is still intermittent, with and without SigInt, but they're all !is_present crashes on Linux. |
|
I just wanted to check in on this, see if there were any other tests or files I could provide to help isolate the issue? |
|
I do not know where the problem comes from:
1) are you crashing because you exhausted the memory ?
2) Is the problem crashing individually and the multiplication of
concurrent solves increases the likelihood of the model crashing.
3) concurrent solves are brittle and there is a rare race condition.
so for 1) monitor memory consumption
for 2) run all the solves sequentially to see if 1 fails
for 3) try to use 1 worker for each solve, but increase the number of
concurrent solve
my 2 cents
Laurent Perron | Operations Research | lperron@google.com | (33) 1 42 68 53
00
Le mar. 2 juin 2020 à 01:24, androidrhodium <notifications@github.com> a
écrit :
… I just wanted to check in on this, see if there were any other tests or
files I could provide to help isolate the issue?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2019 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUPL3KDLZMWO2YTKQJVIXLRUQ2B3ANCNFSM4NA6QTKA>
.
|
|
I believe I just pushed the fix |

What version of OR-tools and what language are you using?
Version: v7.6.7691
Language: Python
Which solver are you using (e.g. CP-SAT, Routing Solver, GLOP, BOP, Gurobi)
CP-SAT
What operating system (Linux, Windows, ...) and version?
Windows 7 / Windows Server 2019
What did you do?
Found that my modeling program was causing intermittent python crashing when using num_serach_workers. Was able to aggravate this crashing by using a large number of workers (100).
What did you expect to see
No python crashing.
What did you see instead?
Consistent python crashing.
My particular program has grown to be very large and dependent on an input file generated from a different program. I can provide the program and a sample input, but only if absolutely necessary, and by email, because my employer may try to claim that my implementation is proprietary.
The following two images are from following the trace output of the execution, with and without solution callback. Something to note in these outputs is the "Total cuts added: ##" prints that come from if logging progress is enabled, but I do not have it enabled. They have appeared during crashed both with and without callback.
Going into debugging in both of the above cases let to a memory access error at the following command in _pywrapsat.pyb
The way my implementation works is by solving one job at a time in batches, using previous solutions in the next solve attempt to lock down variable domains over time. If I lower the number of workers, the crash takes longer to occur so I don't believe this is related to the size, or specifics, of the protobuf.
I am not running solvers in parallel but I am running a solver while building the main model for the main solver.
I am not hitting a memory utilization wall either.
I would be happy to do more debugging on my end, I just need some pointers on where to look!
Thanks!
The text was updated successfully, but these errors were encountered: