Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP-SAT multi-thread crashing python w/ large number of workers #2019

Closed
androidrhodium opened this issue May 14, 2020 · 13 comments
Closed

CP-SAT multi-thread crashing python w/ large number of workers #2019

androidrhodium opened this issue May 14, 2020 · 13 comments
Assignees
Labels
Bug Lang: Python Python wrapper issue OS: Windows Windows OS Solver: CP-SAT Solver Relates to the CP-SAT solver
Projects
Milestone

Comments

@androidrhodium
Copy link

What version of OR-tools and what language are you using?
Version: v7.6.7691
Language: Python

Which solver are you using (e.g. CP-SAT, Routing Solver, GLOP, BOP, Gurobi)
CP-SAT

What operating system (Linux, Windows, ...) and version?
Windows 7 / Windows Server 2019

What did you do?
Found that my modeling program was causing intermittent python crashing when using num_serach_workers. Was able to aggravate this crashing by using a large number of workers (100).

What did you expect to see
No python crashing.

What did you see instead?
Consistent python crashing.

My particular program has grown to be very large and dependent on an input file generated from a different program. I can provide the program and a sample input, but only if absolutely necessary, and by email, because my employer may try to claim that my implementation is proprietary.

The following two images are from following the trace output of the execution, with and without solution callback. Something to note in these outputs is the "Total cuts added: ##" prints that come from if logging progress is enabled, but I do not have it enabled. They have appeared during crashed both with and without callback.

Threading execution error
Threading execution error wo callback

Going into debugging in both of the above cases let to a memory access error at the following command in _pywrapsat.pyb

Threading execution _pywrapsat binary instruction error location

The way my implementation works is by solving one job at a time in batches, using previous solutions in the next solve attempt to lock down variable domains over time. If I lower the number of workers, the crash takes longer to occur so I don't believe this is related to the size, or specifics, of the protobuf.

I am not running solvers in parallel but I am running a solver while building the main model for the main solver.

I am not hitting a memory utilization wall either.

I would be happy to do more debugging on my end, I just need some pointers on where to look!

Thanks!

@Mizux Mizux added this to To do in ToDo via automation May 15, 2020
@Mizux Mizux added Bug Lang: Python Python wrapper issue OS: Windows Windows OS Solver: CP-SAT Solver Relates to the CP-SAT solver labels May 15, 2020
@gregy4
Copy link

gregy4 commented May 15, 2020

Unfortunatelly current version of cp-sat solver doesn't work in windows since version 7.5. Hopefully the problem will be solved with issues like #1918 that has fix version 7.7 and also mention memory access errors.

@lperron
Copy link
Collaborator

lperron commented May 15, 2020 via email

@gregy4
Copy link

gregy4 commented May 15, 2020

Most of the time I don’t execute multiple solvers in parallel. For about one or two weeks I used two instances of solver with 8 search workers in one process (jvm) as a service to calculate different schedules and I didn’t found a problem. Number of calculated of schedules was in hundreds during one test. I used or-tools 7.4.

@androidrhodium
Copy link
Author

I also am not running solvers in parallel. I am setting up one large model, and during which I set up and solve smaller models, then solve the larger model.

I spun up a small Oracle Linux box and at 100 workers I get this:
image
I saw this !is_present on occasion while trying to figure out the windows python crashes but resolved it by reducing the number of variables that make up my objective. No machine resourcing issues.

Lowering the number of workers to 75 results in the same crash behavior. Down to 25 workers I still see !is_present crashes. I'm going to reduce all the little solves I do between and during the setup of the larger solves.

@androidrhodium
Copy link
Author

I changed my program such that there are no intermediate solvers running while building the main models. Now my solvers, and model building, is strictly sequential. The input I'm testing with takes 157 solves to complete. With 100 workers it reaches around 34 solves on average before having the !is_present crash, 50-20 workers reach around 50 solves on average. Moving down to 8 workers it will make it above 140 solves, though rarely completing all 157 solves. Sometimes I get infeasible in the 140s, which doesn't seem correct when looking at the input the solver gets stuck on. I never encountered infeasible using 7.5 on Windows with this series of inputs.

@lperron
Copy link
Collaborator

lperron commented May 20, 2020

Can you test test the workaround for the crash: parameter catch_sigint_signal = true.
It fixes #1918.

@lperron
Copy link
Collaborator

lperron commented May 20, 2020

and please use 7.6 or master. We have fixed a lot or presolve bugs since 7.4.

@lperron
Copy link
Collaborator

lperron commented May 20, 2020

Can you save the model proto at each step and send me the crashing one?

@androidrhodium
Copy link
Author

Using 7.6.7691 with Python on Windows 7, smaller machine (faster crashes), 100 workers, and solver.parameters.catch_sigint_signal = True seemed to make no difference when it comes to the Python crashes. I made sure the system wasn't hitting a memory wall.

Windows100WithoutSigInt.zip This is the series of model protos up to the crashing proto at 100 workers without SigInt. Crashing model proto is Output_Proto_10_74.txt
Windows100WithSigInt.zip This is the series of model protos up to the crashing proto at 100 workers with SigInt. Crashing model proto is Output_Proto_7_74.txt

I will test, and get proto files, on Linux sometime today or tomorrow.

Thanks!

@androidrhodium
Copy link
Author

On Oracle Linux 8.2, with or without the solver.parameters.catch_sigint_signal = True flag I still get !is_present crashing.

OracleLinux100WithoutSigInt.zip This is the series of model protos up to the crashing proto at 100 workers without SigInt. Crashing model proto is Output_Proto_34_74.txt

OracleLinux100WithSigInt.zip This is the series of model protos up to the crashing proto at 100 workers with SigInt. Crashing model proto is Ooutput_Proto_51_74.txt

Crashing is still intermittent, with and without SigInt, but they're all !is_present crashes on Linux.

@androidrhodium
Copy link
Author

I just wanted to check in on this, see if there were any other tests or files I could provide to help isolate the issue?

@lperron
Copy link
Collaborator

lperron commented Jun 3, 2020 via email

@lperron
Copy link
Collaborator

lperron commented Aug 20, 2020

I believe I just pushed the fix

@lperron lperron closed this as completed Aug 20, 2020
ToDo automation moved this from To do to Done Aug 20, 2020
@Mizux Mizux added this to the v8.0 milestone Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Lang: Python Python wrapper issue OS: Windows Windows OS Solver: CP-SAT Solver Relates to the CP-SAT solver
Projects
ToDo
  
Done
Development

No branches or pull requests

4 participants