-
-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cx_Freeze with torch.multiprocessing using wrong source in child processes #2376
Comments
Re-opening as my fix I previously posted doesn't actually work (unnless source is present in launch folder). |
This information is for debug. This can be changed with replace_paths. The real bug however must be the use of multiprocessing. Using stdlib's multiprocessing, we need to use freeze_support, but torch.multiprocessing should not have this function and so a way around this must be analyzed. |
I tried freeze_support(), which works/is needed on windows, but not linux. I'm not sure what paths I would replace. It appears to look for the python files of the app in the folder that the executable is run from, and throws an error that itcan't find them. e.g. running from the build folder.... FileNotFoundError: [Errno 2] No such file or directory: '/some/folder/build/exe.linux-x86_64-3.11/Minimal.py' If you run it from another location it complains they are not in that location (always with the full path of that location). D. |
Can you test with cx_Freeze 7.0 and with dev release? You can test with the latest development build: |
There's still an issue: Still issue with finding the source:
|
From what I understand you are using conda for Linux. What command did you use to install this specific version of Torch? |
Actually I set up the environment with conda, but used pip to install the modules as I couldn't get the versions I needed with conda. I think the command was: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 |
Using your Minimal.py and command line: cxfreeze --script Minimal.py build_exe --replace-paths '*=' Will be available in cx_Freeze 7.1.0.dev16 |
https://cx-freeze--2382.org.readthedocs.build/en/2382/faq.html#multiprocessing-support You can test the patch in the latest development build: |
So I did: cxfreeze --script Minimal.py build_exe --replace-paths '*=' And, I now get the error: FileNotFoundError: [Errno 2] No such file or directory: '/my/home/folder/minimal_bug/build/exe.linux-x86_64-3.11/=/Minimal.py' |
Please check if you have cx_Freeze 7.1.0.dev16 with: |
Actually it was cxfreeze 7.1.0-dev15. Not sure how that happened, as I followed your instructions. I just tried it again and now I have 7.1.0-dev16. However, same output: FileNotFoundError: [Errno 2] No such file or directory: '/my/home/folder/minimal_bug/build/exe.linux-x86_64-3.11/=/Minimal.py' |
Uninstall cx_Freeze and reinstall. Are you using the pip or conda version? Probably some conflict. |
I was using PIP. I uninstalled and re-installed via pip, and same error. I then tried uninstalling via pip, and installing via conda, and I get: $ conda install -y --no-channel-priority -S -c https://marcelotduarte.github.io/packages/conda cx_Freeze UnavailableInvalidChannel: HTTP 404 NOT FOUND for channel packages/conda https://marcelotduarte.github.io/packages/conda The channel is not accessible or is invalid. You will need to adjust your conda configuration to proceed. |
Initially, I did two tests. If you can do the same, to eliminate any bugs. I created a new environment using the system python and another using Conda. But if you test this second option the way I tested it, it's already good.
Note that I used the cpu version, use that too. Then I will test using Cuda. |
In the meantime I re-installed using conda by doing: wget https://marcelotduarte.github.io/packages/conda/linux-64/cx_freeze-7.1.0.dev16-py311h459d7ec_0.conda I also got the same error. Note: The whole point of torch.multiprocessing is to use multiple GPUs, so it working just on CPU isn't that useful. I'll try to create an entirely new environment from scratch with conda and see if it works... |
I created an entirely new environment with just cx_Freeze and torch (GPU version) with the same issue, this is my history; 1050 conda create --name cxtest python=3.11 Output (Note: Ever so slightly different from before as getting SIGTERM that I didn't before, same file missing error though): $ ./Minimal |
The conda version has a bug, I'll try to solve it.
I had told you to use replace_paths exactly to remove the complete path information in the traceback, but I see that it now causes the (previous) error or the SIGTERM. I'll investigate it. But, using only: |
Using (cxtest) $ ./Minimal |
|
I can't see a difference! (cxtest) ]$ python -VV cx_Freeze 7.1.0.dev16 |
Are you sure you don't have the source files in the folder you are running the executable from? It's the only thing I can think of. |
Now, I understand the situation. |
This version just hangs. You run the program and it outputs absolutely nothing to the screen, and doesn't return. EDIT: If you leave it long enough, it does actually run ok. I'm just timing it now to see how log, but it was more than a few miniutes. History: |
I changed the code a bit to check import torch
def per_device_launch_fn(current_gpu_index, num_gpu):
for i in range(1, 1000):
print("Train...")
num_gpu = 4
if __name__ == "__main__":
print("Starting multiprocessing:", num_gpu, __file__)
torch.multiprocessing.start_processes(
per_device_launch_fn,
args=(num_gpu,),
nprocs=num_gpu,
join=True,
start_method="spawn",
) $ time python Minimal.py
$ cxfreeze --script Minimal.py build_exe
In the next run, the time is similar to the time used by the python command:
And next time too:
But, using to build:
|
I think the timing thing may have been system related (it's a shared computer) one run took 1.5hours last week, but today it's not taking that long. One other issue I did notice though is that every sub-process in the frozen version the following is true: if name == "main": resulting in this code being called N times, whereas in the python version it's only called once. This doesn't matter in the minimal example (the training loop is called 4 times with different values of current_gpu_index), but for my real program the logic is a bit more complex in main() as it checks sys.argv in the main process, which results in different behaviour in python and frozen version. I maybe able to re-write the code to get round this, but it does strike me as a bug, as presumably torch.multiprocessing must be doing something to ensure per_device_launch_fn() is called in the python version, whereas in the frozen version it is being called via main(). I'm doing some testing to see if this is significant. Edit: Child processed seem to be called with the following (additional*?) arguments: --multiprocessing-fork tracker_fd=XX pipe_handle=YY Where XX is the same for all children, and YY is different for each child. I'm assuming in the python version the torch.multiprocessing code reads these and puts sys.argv back how you might expect. [* My program has no arguments, so it's not clear if they are additional, or replacements] |
The hook that I used to patch multiprocessing is based on #264 and later I discovered a patch similar (##501 (comment)), even open python/cpython#104607.
I don't think it's very different, see how the spawn is described. |
Update: Simply doing this works round this:
To be clear, this is not necessary in the python version. |
Release 7.1.0 is out! I'll continue to work on pytorch hook to optimize it. |
Based on information from you and others, I improved the hook for multiprocessing. import torch
from multiprocessing import freeze_support
def per_device_launch_fn(current_gpu_index, num_gpu):
for i in range(1, 1000):
print("Train...")
num_gpu = 4
if __name__ == "__main__":
freeze_support()
print("Starting multiprocessing:", num_gpu, __file__)
torch.multiprocessing.start_processes(
per_device_launch_fn,
args=(num_gpu,),
nprocs=num_gpu,
join=True,
start_method="spawn",
) |
Release 7.1.1 is out! |
Prerequisite
Describe the bug
On linux when I use cx_Freeze with a python script that uses torch.multiprocessing to create multiple threads (which essentially calls multiprocessing) the child processes seem to try to use the original python files (for the program) and original python environment (for python modules), not the ones in the build directory. The initial result of this is errors about the program source .py files not being found. Other errors can occur if the source is copied into the build folder.
To Reproduce
Environment is linux, python 3.11, pytorch v2.2.2+cu121 [Note: This problem does not occur on windows]
Minimal source (Minimal.py):
build script is
Expected behavior
I would expect the pyc versions of code in the build folder to be used under all circumstances (even by child processes), not the original ones.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: