Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in code builders with SCons 3.0.2+ on Windows #24846

Closed
akien-mga opened this issue Jan 8, 2019 · 11 comments
Closed

Race condition in code builders with SCons 3.0.2+ on Windows #24846

akien-mga opened this issue Jan 8, 2019 · 11 comments

Comments

@akien-mga
Copy link
Member

Godot version:
Master (e4b0251)

OS/device including version:
Windows (10 I suppose, only checked on AppVeyor CI)
Python 2.7.15.
SCons 3.0.2 or 3.0.3.

Issue description:
I tried the newly released 3.0.2 (then yanked due to a packaging issue) and 3.0.3 in #24837 on AppVeyor CI, and it fails (so far in reproducible fashion, on 4-5 attempts) with:

build_gdnative_wrapper_code(["modules\gdnative\gdnative_wrapper_code.gen.cpp"], ["modules\gdnative\gdnative_api.json"])
Executing builder function in subprocess: module_path='C:\\projects\\godot\\modules\\gdnative\\gdnative_builders.py', parameter_file='C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\386590e732354246983eb9a7c1278d1d.json', parameter_file_size=16106L, target=['C:\\projects\\godot\\modules\\gdnative\\gdnative_wrapper_code.gen.cpp'], source=['C:\\projects\\godot\\modules\\gdnative\\gdnative_api.json']
cl /Fomodules\gdnative\gdnative_wrapper_code.gen.windows.opt.tools.64.obj /c modules\gdnative\gdnative_wrapper_code.gen.cpp /TP /nologo /O2 /MT /Gd /GR /W3 /wd4267 /wd4244 /wd4305 /wd4018 /wd4800 /EHsc -DZSTD_STATIC_LINKING_ONLY -DFREETYPE_ENABLED -DGLAD_ENABLED -DGLES_OVER_GL /DDEBUG_ENABLED /DWINDOWS_ENABLED /DOPENGL_ENABLED /DRTAUDIO_ENABLED /DWASAPI_ENABLED /DWINMIDI_ENABLED /DTYPED_METHOD_BIND /DWIN32 /DMSVC /DWINVER=0x0601 /D_WIN32_WINNT=0x0601 /DNOMINMAX /D_WIN64 /DPTRCALL_ENABLED /DTOOLS_ENABLED /DGDSCRIPT_ENABLED /DMINIZIP_ENABLED /Ieditor /I. /Iplatform\windows "/IC:\Program Files (x86)\Windows Kits\10\Include" "/IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "/IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "/IC:\Program Files (x86)\Windows Kits\10\Include\10.0.14393.0\ucrt" "/IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "/IC:\Program Files (x86)\Windows Kits\10\Include\10.0.14393.0\shared" "/IC:\Program Files (x86)\Windows Kits\10\Include\10.0.14393.0\um" "/IC:\Program Files (x86)\Windows Kits\10\Include\10.0.14393.0\winrt" /Ithirdparty\zlib /Ithirdparty\zstd /Ithirdparty\rtaudio /Ithirdparty\glad /Ithirdparty\libpng /Ithirdparty\freetype\include /Imodules\gdnative\include
gdnative_wrapper_code.gen.cpp
modules\gdnative\gdnative_wrapper_code.gen.cpp(8): fatal error C1083: Cannot open include file: 'gdnative_api_struct.gen.h': No such file or directory

AppVeyor build: https://ci.appveyor.com/project/akien-mga/godot/builds/21459795

It still runs fine with SCons 3.0.1. I've reported this to SCons maintainers who mention that such race conditions on Windows + Python are a known issue (as we've experienced too in the past), and that 3.0.2+ involved some performance improvements that could exacerbate this issue in our case (it might not have failed in 3.0.1 because it was slow enough to release the file handle in time).

As mentioned in #17595 (comment) already and confirmed today on IRC, SCons' @bdbaddog advises that we move our code builders to their own standalone Python scripts and execute them directly via a shell (so python /path/to/script.py).

Excerpts from IRC discussion:

15:40 <@bdbaddog> re race. That's a known windows problems. We have some workarounds but no fix as yet. It seems likely it's an issue with the fact that python  releases the GIL around closing files.. (windows takes longer to close files than other OS's it seems)
15:44 <@bdbaddog> The race condition is why I advocate for only simply python logic in process in scons. move the generator to a script and then a) your issue should go away, b) you can run it in paralell (running in process holds the GIL..)
[...]
16:44 <Akien> bdbaddog: When you say "only simple python logic in process", you mean that we should move this logic to a non-Python script?
16:45 <@bdbaddog> no. I'm sugguesting rather than have a ptyhon based builder in scons which does soemthing non-trivial that you put it in a separate python script that scons would then run via a "normal' action via a shell
[...]
16:48 <Akien> We actually had a change somewhat like that a few months ago which fix the race conditions we were having on Windows: https://github.com/godotengine/godot/pull/17595
[...]
16:49 <Akien> If we run the scripts directly via a shell, how do we handle the dependencies in SCons?
16:50 <Akien> Will scons wait for the shell to exit before continuing?
16:50 <@bdbaddog> how do you handle it now? scanner? emitter?
16:50 <@bdbaddog> yes will wait for shell before continuing
16:51 <@bdbaddog> looks like whatever logic is giving you the issue wasn't handled by that pull request?
16:51 <Akien> Well it worked great until SCons 3.0.2 :P
16:52 <Akien> That PR moved the running of those generators to a platform-specific `run_in_subprocess`: https://github.com/godotengine/godot/blob/master/platform_methods.py#L16
16:52 <@bdbaddog> there's some speedups in 3.0.2.. so such improvements tend to exacerbate the windows file close issues..
16:53 <@bdbaddog> ugh.. you're calling subprocess in a builder instead of just using scons' normal builder/action mechanism..
16:54 <Akien> Well I think that's what this PR changed, we used to use SCons Builder but couldn't fix the Windows issues
16:54 <@bdbaddog> you were using builder with a python action,  instead of a normal shell action.
16:54 <@bdbaddog> that was most likely your issue.
16:55 <@bdbaddog> from my comment: Rather than moving your python generator functions such that they can be run separately and then running them via a builder which calls subprocess, wouldn't it be simpler to just have your builders call those python scripts via normal builders? Since the issue (if I understand correctly) was that you had blobs of python logic in your builders in the same python process and GIL issues, and you've resolve that by making the
16:55 <@bdbaddog>  generators independent scripts.
16:55 <@bdbaddog> So your Action would now be "python <generator_script.py> "? This would run the python scripts in a separate process
[...]
17:00 <@bdbaddog> It's easy to get carried away using python in scons for builders and such. but on windows (and occasionally linux/mac) such usage can bite you.  In general move the "doing" out of scons, but the what to do, etc inside.  That's where python is most useful. 
17:00 <@bdbaddog> Plus this lets you more easily test your generator logic.

CC @viktor-ferenczi @hpvb

@akien-mga
Copy link
Member Author

CC @SeleckyErik

@rxlecky
Copy link
Contributor

rxlecky commented Feb 3, 2019

I don't seem to be able to replicate this locally with Python 3.7.0 and SCons 3.0.3

@bdbaddog
Copy link

bdbaddog commented Feb 3, 2019

I don't seem to be able to replicate this locally with Python 3.7.0 and SCons 3.0.3

In my testing with similar issues, I've found if you don't have a reasonably fast system and storage it's less likely to show up. The issue (at least the working theory) is that Python releases the GIL around file open and closes and due to (seemingly) windows filesystems taking longer to do so than non-windows the files are being opened by other processes which have them locked before they actually close on the filesystem. As you might guess nailing this down exactly has been daunting.

That said I believe in python 3.5+ file open's are supposed to explicitly set flags so threads don't share file handles. So it could be that this improvement could be resolving this long standing issue.

Though in general my advice has long been don't do anything too significant in process in scons. You'll hold the GIL regardless and reduce your builds ability to parallelize.

@rxlecky
Copy link
Contributor

rxlecky commented Feb 3, 2019

I wanted to test the solution that you sugessted, i. e. running the generator in a separate process, but I have no way of testing whther that fixed it.

Wait, what you said is that by running it in a separate process will make it easier for SCons to parallelize the build? I thought that this will make the calling thread stall and wait till the spawned process finishes and returns?

@bdbaddog
Copy link

bdbaddog commented Feb 3, 2019 via email

@rxlecky
Copy link
Contributor

rxlecky commented Feb 4, 2019

Indeed. I totally understand why you'd want that. I was just wondering what you meant by better parallelization opportunity; maybe there's just something I overlooked :)

@bdbaddog
Copy link

bdbaddog commented Feb 4, 2019

Indeed. I totally understand why you'd want that. I was just wondering what you meant by better parallelization opportunity; maybe there's just something I overlooked :)

Are you talking about this sentence:

Though in general my advice has long been don't do anything too significant in process in scons. You'll hold the GIL regardless and reduce your builds ability to parallelize.

Actions in SCons come in several flavors (actions are the things that get run to do your build steps). One of which is Python (general a class or function.. you can see more in the manpage). When you do too much in a python action like any python which is running in a thread, it's going to hold the GIL and stop all other python from running until something releases the GIL. So if you did a bunch off file processing/templating/etc in a python action in scons it wouldn't drop the GIL until it some IO call most likely. Anyway that's what I'm talking about. You want that stuff to happen when the GIL is not being held. The simplest way to do that is to have the command run in a shell. with a normal action. So you move the logic into a standalone script and pass the arguments to it. Which while not terribly intuitive if you have a bunch of cores can be faster. (While also avoiding these pesky race conditions which seem to almost exclusively show up on windows)

@rxlecky
Copy link
Contributor

rxlecky commented Feb 4, 2019

Oh, alright, tahat makes sense. Thank you for explanation!

@rxlecky
Copy link
Contributor

rxlecky commented Feb 4, 2019

Also, you were most likely right that the reson why I couldn't replicate this was that my setup is not fast enough. I set up AppVeyor for my fork of Godot and the issue immediately emerged. Thanks for all the help!

@bdbaddog
Copy link

bdbaddog commented Feb 4, 2019

@SeleckyErik - No problemo. We're here to help. Unfortunately fixing this race condition in SCons is tricky bit and moving the cause out seems generally a better solution when it can be done. So with a reasonable workaround it may take a while to dedicate (likely a week) to nail it down, if it turns out that it can be fixed in SCons and not require changes in Python..

@akien-mga akien-mga modified the milestones: 3.2, 4.0 Jan 9, 2020
@akien-mga akien-mga self-assigned this Jan 9, 2020
@akien-mga akien-mga modified the milestones: 4.0, 3.2 Jan 9, 2020
@akien-mga
Copy link
Member Author

Seems to work in the current master branch (as per #34717).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants