New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encountering FileNotFoundError while Compiling Triton Kernel in Distributed Training #2688
Comments
+1 with this issue |
+1 |
A fix would be making a customized In this manager, we only put the files from rank 0 and make a barrier for all ranks, e.g.:
In my case, it works fine (you must ensure the code path is the same on all ranks). UPD (2024.04.07): I think this problem is fixed after #3544, perhaps you don't need a customized manager anymore. |
@HIT-cwh Hi, I met the same issue and resolved it by setting The root cause in my case is my multi-nodes cluster with a shared storage. There are coincidentally processes with the same As a result, 2 processes tries to @LyricZhao Is there any plan to support multi-node cluster with shared storage, without a handcraft |
Hi, I guess this PR will help, #3544. BTW, I think |
During the process of distributed training, I encountered the following problem when compiling Triton kernels:
The above error only occurs during distributed training (multi-process), and both '/mnt/petrelfs/caoweihan/.triton/cache/cff628804055ab05f902072733c9ab2d/_rms_norm_bwd_dx_fused.ttir.tmp.pid_15735_304289' and '/mnt/petrelfs/caoweihan/.triton/cache/cff628804055ab05f902072733c9ab2d/_rms_norm_bwd_dx_fused.ttir' files do exist.
Given that the intermediate results across different processes are identical, I attempted to replace:
with:
This tweak squashed the error, but it's not cool.
I would appreciate if anyone could explain why this issue arises. After all,
os.replace(temp_path, filepath)
should be playing nice as an atomic operation.Here is my system environment:
The text was updated successfully, but these errors were encountered: