Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountering FileNotFoundError while Compiling Triton Kernel in Distributed Training #2688

Open
HIT-cwh opened this issue Nov 21, 2023 · 5 comments

Comments

@HIT-cwh
Copy link

HIT-cwh commented Nov 21, 2023

During the process of distributed training, I encountered the following problem when compiling Triton kernels:

Traceback (most recent call last):
......
File "/mnt/petrelfs/caoweihan/anaconda3/envs/deepspeed/lib/python3.10/site-packages/triton/compiler/compiler.py", line 482, in compile
  metadata_group[ir_filename] = fn_cache_manager.put(next_module, ir_filename)
File "/mnt/petrelfs/caoweihan/anaconda3/envs/deepspeed/lib/python3.10/site-packages/triton/runtime/cache.py", line 109, in put
  os.replace(temp_path, filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/caoweihan/.triton/cache/cff628804055ab05f902072733c9ab2d/_rms_norm_bwd_dx_fused.ttir.tmp.pid_15735_304289' -> '/mnt/petrelfs/caoweihan/.triton/cache/cff628804055ab05f902072733c9ab2d/_rms_norm_bwd_dx_fused.ttir'

The above error only occurs during distributed training (multi-process), and both '/mnt/petrelfs/caoweihan/.triton/cache/cff628804055ab05f902072733c9ab2d/_rms_norm_bwd_dx_fused.ttir.tmp.pid_15735_304289' and '/mnt/petrelfs/caoweihan/.triton/cache/cff628804055ab05f902072733c9ab2d/_rms_norm_bwd_dx_fused.ttir' files do exist.

Given that the intermediate results across different processes are identical, I attempted to replace:

# copy from https://github.com/openai/triton/blob/main/python/triton/runtime/cache.py#L129
os.replace(temp_path, filepath)

with:

try:
    os.replace(temp_path, filepath)
except:
    pass

This tweak squashed the error, but it's not cool.

I would appreciate if anyone could explain why this issue arises. After all, os.replace(temp_path, filepath) should be playing nice as an atomic operation.

Here is my system environment:

    sys.platform: linux
    Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
    CUDA available: True
    numpy_random_seed: 250149167
    GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-80GB
    CUDA_HOME: /mnt/petrelfs/share/cuda-11.7
    NVCC: Cuda compilation tools, release 11.7, V11.7.99
    GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
    PyTorch: 2.1.0+cu121
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

    TorchVision: 0.16.0+cu121
    OpenCV: 4.8.1
@LyricZhao
Copy link
Contributor

+1 with this issue

@SolenoidWGT
Copy link

+1

@LyricZhao
Copy link
Contributor

LyricZhao commented Dec 20, 2023

@HIT-cwh @SolenoidWGT

A fix would be making a customized CacheManager by setting os.environ["TRITON_CACHE_MANAGER"] = '...'. Reference: https://github.com/openai/triton/blob/main/python/triton/runtime/cache.py.

In this manager, we only put the files from rank 0 and make a barrier for all ranks, e.g.:


class ModifiedCacheManager(FileCacheManager):
    def put(self, data, filename, binary=True) -> str:
        if not self.cache_dir:
            raise RuntimeError("Could not create or locate cache dir")
        binary = isinstance(data, bytes)
        if not binary:
            data = str(data)
        assert self.lock_path is not None
        filepath = self._make_path(filename)
        # Random ID to avoid any collisions
        rnd_id = random.randint(0, 1000000)
        # we use the PID incase a bunch of these around so we can see what PID made it
        pid = os.getpid()
        # use tempfile to be robust against program interruptions

        # *** Rank 0 only ***
        if get_rank() == 0:
            temp_path = f"{filepath}.tmp.pid_{pid}_{rnd_id}"
            mode = "wb" if binary else "w"
            with open(temp_path, mode) as f:
                f.write(data)
            # Replace is guaranteed to be atomic on POSIX systems if it succeeds
            # so filepath cannot see a partial write
            os.replace(temp_path, filepath)
        # *** Add a distributed barrier ***
        barrier()

        return filepath

In my case, it works fine (you must ensure the code path is the same on all ranks).

UPD (2024.04.07): I think this problem is fixed after #3544, perhaps you don't need a customized manager anymore.

@C1rN09
Copy link

C1rN09 commented Apr 7, 2024

@HIT-cwh Hi, I met the same issue and resolved it by setting TRITON_CACHE_DIR to a local storage instead of a shared storage.

The root cause in my case is my multi-nodes cluster with a shared storage. There are coincidentally processes with the same pid, since they are on different machines. And I set the same random seed, so they are sharing the same rng_id.

https://github.com/openai/triton/blob/28f74253f7e36b2609c998ebbf7ae70acf7ec313/python/triton/runtime/cache.py#L120

As a result, 2 processes tries to os.replace the same temp_path simultaneously, and the latter one crashes because temp_path doesn't exist.

@LyricZhao Is there any plan to support multi-node cluster with shared storage, without a handcraft CacheManager plugin? I suspect reverting the following PR #1569 can make it work.

@LyricZhao
Copy link
Contributor

LyricZhao commented Apr 7, 2024

@HIT-cwh Hi, I met the same issue and resolved it by setting TRITON_CACHE_DIR to a local storage instead of a shared storage.

The root cause in my case is my multi-nodes cluster with a shared storage. There are coincidentally processes with the same pid, since they are on different machines. And I set the same random seed, so they are sharing the same rng_id.

https://github.com/openai/triton/blob/28f74253f7e36b2609c998ebbf7ae70acf7ec313/python/triton/runtime/cache.py#L120

As a result, 2 processes tries to os.replace the same temp_path simultaneously, and the latter one crashes because temp_path doesn't exist.

@LyricZhao Is there any plan to support multi-node cluster with shared storage, without a handcraft CacheManager plugin? I suspect reverting the following PR #1569 can make it work.

Hi, I guess this PR will help, #3544.

BTW, I think FileLock isn't well supported on some NFS, while the atomic feature of os.replace is guaranteed on almost all FS. So making pid and rng_id different is enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants