Five distributed tests are failing on one GPU. #2257

wujingyue · 2024-05-15T23:29:22Z

These tests should pass in theory because one GPU is just a special case. It also has the practical benefit of allowing anyone to sanity check their changes without having to have multiple GPUs. So I gave that a quick try and found some interesting failures:

$ _bn && mpirun -np 1 bin/test_multidevice

[  FAILED  ] 5 tests, listed below:
[  FAILED  ] DistributedMatmulTest.LayoutNT_AllReduce
[  FAILED  ] DistributedMatmulTest.LayoutNT_ReduceScatter
[  FAILED  ] PipelineTestStagedReduction.StagedReduction/Manual, where GetParam() = Manual
[  FAILED  ] MultideviceShardingTest.UnshardedGlobalInput/concrete_sharded_along_dim_0, where GetParam() = (true, 0)
[  FAILED  ] MultideviceShardingTest.UnshardedGlobalInput/concrete_sharded_along_dim_1, where GetParam() = (true, 1)

The DistributedMatmulTest.LayoutNT_AllReduce failure looks like a too tight threshold. Other failures seem to indicate bugs. I'm copying the error message below.

[ RUN      ] DistributedMatmulTest.LayoutNT_AllReduce
unknown file: Failure
C++ exception with description "aten_output_tensor.allclose( fusion_output_tensor.to(aten_output_tensor.dtype()), tolerance_values.second, tolerance_values.first, true) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/tests/cpp/validator.h":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues.

Validation error in output 0 on line 235 in file /opt/pytorch/nvfuser/tests/cpp/test_multidevice_matmul.cpp.
  Detected abs error of: 1.90735e-05
    absolute tolerance was set to 1.30594e-05
    and relative tolerance set to 1.30594e-07
Exception raised from testValidate at /opt/pytorch/nvfuser/tests/cpp/validator.h:119 (most recent call first):
frame #0: <unknown function> + 0x8e9af (0x55702292a9af in bin/test_multidevice)
frame #1: <unknown function> + 0x27f363 (0x557022b1b363 in bin/test_multidevice)
frame #2: <unknown function> + 0x69a828 (0x557022f36828 in bin/test_multidevice)
frame #3: <unknown function> + 0x69d795 (0x557022f39795 in bin/test_multidevice)
frame #4: <unknown function> + 0x7130d1 (0x557022faf0d1 in bin/test_multidevice)
frame #5: <unknown function> + 0x6fec15 (0x557022f9ac15 in bin/test_multidevice)
frame #6: <unknown function> + 0x6ff3a2 (0x557022f9b3a2 in bin/test_multidevice)
frame #7: <unknown function> + 0x6fff2b (0x557022f9bf2b in bin/test_multidevice)
frame #8: <unknown function> + 0x708584 (0x557022fa4584 in bin/test_multidevice)
frame #9: <unknown function> + 0x6ff585 (0x557022f9b585 in bin/test_multidevice)
frame #10: <unknown function> + 0x100e64 (0x55702299ce64 in bin/test_multidevice)
frame #11: <unknown function> + 0x29d90 (0x7f1bb73d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #12: __libc_start_main + 0x80 (0x7f1bb73d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #13: <unknown function> + 0x1012a5 (0x55702299d2a5 in bin/test_multidevice)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1715814906 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='DistributedMatmulTest.LayoutNT_AllReduce'
[  FAILED  ] DistributedMatmulTest.LayoutNT_AllReduce (556 ms)

Sounds like a too tight threshold.

[ RUN      ] DistributedMatmulTest.LayoutNT_ReduceScatter
To reproduce: NVFUSER_TEST_RANDOM_SEED=1715814907 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='DistributedMatmulTest.LayoutNT_ReduceScatter'
unknown file: Failure
C++ exception with description "!detect_exception_in_thread_pool.load() INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/kernel_cache.cpp":1335, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Detected exception while compiling fusion segments in parallel. Error messages from all threads are printed below.

Error from segmentation group 0: Splitting an axis of non-Serial parallel type is not supported at this time. Parallelization strategy must be set after calling split.. Tensor: T11_g[ ideviceIdx.x73{( i0 * ( 1 * ( i3 * i6 ) ) )} ] (DeviceMesh{0})
Exception raised from split at /opt/pytorch/nvfuser/csrc/tensor_view.cpp:499 (most recent call first):
frame #0: <unknown function> + 0x8e9af (0x55702292a9af in bin/test_multidevice)
frame #1: <unknown function> + 0x5b825b (0x557022e5425b in bin/test_multidevice)
frame #2: <unknown function> + 0x5bb4f2 (0x557022e574f2 in bin/test_multidevice)
frame #3: <unknown function> + 0x552749 (0x557022dee749 in bin/test_multidevice)
frame #4: <unknown function> + 0x553155 (0x557022def155 in bin/test_multidevice)
frame #5: <unknown function> + 0x4411a6 (0x557022cdd1a6 in bin/test_multidevice)
frame #6: <unknown function> + 0x44142c (0x557022cdd42c in bin/test_multidevice)
frame #7: c10::ThreadPool::main_loop(unsigned long) + 0x2bd (0x7f1bf4895c8d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0xdc253 (0x7f1bb76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #9: <unknown function> + 0x94ac3 (0x7f1bb743fac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #10: <unknown function> + 0x126850 (0x7f1bb74d1850 in /usr/lib/x86_64-linux-gnu/libc.so.6)


Use NVFUSER_DISABLE=parallel_compile to simplify error message.
Exception raised from compileFusionParallel at /opt/pytorch/nvfuser/csrc/kernel_cache.cpp:1335 (most recent call first):
frame #0: <unknown function> + 0x8e9af (0x55702292a9af in bin/test_multidevice)
frame #1: <unknown function> + 0x27f363 (0x557022b1b363 in bin/test_multidevice)
frame #2: <unknown function> + 0x449bda (0x557022ce5bda in bin/test_multidevice)
frame #3: <unknown function> + 0x4548a3 (0x557022cf08a3 in bin/test_multidevice)
frame #4: <unknown function> + 0x499494 (0x557022d35494 in bin/test_multidevice)
frame #5: <unknown function> + 0x499d37 (0x557022d35d37 in bin/test_multidevice)
frame #6: <unknown function> + 0x69e741 (0x557022f3a741 in bin/test_multidevice)
frame #7: <unknown function> + 0x7130d1 (0x557022faf0d1 in bin/test_multidevice)
frame #8: <unknown function> + 0x6fec15 (0x557022f9ac15 in bin/test_multidevice)
frame #9: <unknown function> + 0x6ff3a2 (0x557022f9b3a2 in bin/test_multidevice)
frame #10: <unknown function> + 0x6fff2b (0x557022f9bf2b in bin/test_multidevice)
frame #11: <unknown function> + 0x708584 (0x557022fa4584 in bin/test_multidevice)
frame #12: <unknown function> + 0x6ff585 (0x557022f9b585 in bin/test_multidevice)
frame #13: <unknown function> + 0x100e64 (0x55702299ce64 in bin/test_multidevice)
frame #14: <unknown function> + 0x29d90 (0x7f1bb73d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #15: __libc_start_main + 0x80 (0x7f1bb73d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x1012a5 (0x55702299d2a5 in bin/test_multidevice)
" thrown in the test body.

[  FAILED  ] DistributedMatmulTest.LayoutNT_ReduceScatter (578 ms)

[ RUN      ] PipelineTestStagedReduction.StagedReduction/Manual
To reproduce: NVFUSER_TEST_RANDOM_SEED=1715814911 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='PipelineTestStagedReduction.StagedReduction/Manual'
unknown file: Failure
C++ exception with description "false INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/executor_utils.cpp":829, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues.
__global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 3, 3> T0, Tensor<float, 1, 1> T2, Tensor<float, 2, 2> T1, Tensor<int64_t, 1, 1> T5) {
  alignas(16) extern __shared__ char array[];
  void* shared_mem = array;
  NVFUSER_DEFINE_MAGIC_ZERO;
  nvfuser_index_t i0;
  i0 = (T0.alloc_stride[2LL] * ((nvfuser_index_t)threadIdx.x)) + (T0.alloc_stride[1LL] * ((nvfuser_index_t)blockIdx.x));
  nvfuser_index_t i1;
  i1 = 32LL * T0.alloc_stride[2LL];
  nvfuser_index_t i2;
  i2 = -64LL + ((nvfuser_index_t)threadIdx.x);
  bool b3;
  b3 = (((nvfuser_index_t)blockIdx.x) == 0LL) && (((nvfuser_index_t)threadIdx.x) == 0LL);
  // Allocate global tensor T1
  *(volatile float*)&T1[((nvfuser_index_t)blockIdx.x)] = 0.000000000e+00f;
  float T3[4LL];
  #pragma unroll
  for(nvfuser_index_t i4 = 0LL; i4 < 4LL; ++i4) {
    T3[i4] = 0.000000000e+00f;
  }
  NVFUSER_UPDATE_MAGIC_ZERO;
  if (((((nvfuser_index_t)threadIdx.x) + 96LL) < 64LL)) {
    #pragma unroll
    for(nvfuser_index_t i4 = 0LL; i4 < 4LL; ++i4) {
      T3[i4]
        = T3[i4]
        + T0[(i0 + (i1 * (i4 + nvfuser_zero)))];
    }
  } else {
    #pragma unroll
    for(nvfuser_index_t i4 = 0LL; i4 < 4LL; ++i4) {
      nvfuser_index_t i5;
      i5 = i4 + nvfuser_zero;
      if ((i2 < (-(32LL * i5)))) {
        T3[i4]
          = T3[i4]
          + T0[(i0 + (i1 * i5))];
      }
    }
  }
  NVFUSER_UPDATE_MAGIC_ZERO;
  float T4[1LL];
  T4[0LL] = 0.000000000e+00f;
  #pragma unroll
  for(nvfuser_index_t i6 = 0; i6 < 4LL; ++i6) {
    T4[0LL]
      = T4[0LL]
      + T3[i6];
  }
  NVFUSER_UPDATE_MAGIC_ZERO;
  blockReduce<true, false, false, true>(*(volatile float*)&T1[((nvfuser_index_t)blockIdx.x)], T4[0LL], [](float &a, float b) { a = a + b; }, static_cast<float*>(shared_mem), true, float(0.000000000e+00f));
  // Allocate global tensor T5
  grid_sync::sync<true, false, false, true, true>(T5[index_utils::maskedOffset<false, true, true>(blockIdx, gridDim)], index_utils::maskedSize<true, false, false>(gridDim));
  #pragma unroll
  for(nvfuser_index_t i7 = 0; i7 < 8LL; ++i7) {
    nvfuser_index_t i8;
    i8 = i7 + nvfuser_zero;
    if (b3) {
      T2[i8]
         = *(volatile float*)&T1[i8];
    }
  }
  NVFUSER_UPDATE_MAGIC_ZERO;
}
}

CUDA NVRTC compile error: __tmp_kernel_none_f0_c0_r0_g0.cu(10263): error: no instance of overloaded function "<unnamed>::blockReduce" matches the argument list
            argument types are: (volatile float, float, lambda [](float &, float)->void, float *, __nv_bool, float)
    blockReduce<true, false, false, true>(*(volatile float*)&T1[((nvfuser_index_t)blockIdx.x)], T4[0LL], [](float &a, float b) { a = a + b; }, static_cast<float*>(shared_mem), true, float(0.000000000e+00f));
    ^
__tmp_kernel_none_f0_c0_r0_g0.cu(4946): note #3327-D: candidate function template "<unnamed>::blockReduce<X_REDUCE,Y_REDUCE,Z_REDUCE,Aligned,T,Func>(T &, const T &, Func, T *, __nv_bool, T)" failed deduction
  __device__ void blockReduce(
                  ^
__tmp_kernel_none_f0_c0_r0_g0.cu(4874): note #3322-D: number of parameters of function template "<unnamed>::blockReduce<X_REDUCE,Y_REDUCE,Z_REDUCE,Aligned,T,Func>(T &, const T &, Func, T *, __nv_bool, __nv_bool, T)" does not match the call
  __device__ void blockReduce(
                  ^

1 error detected in the compilation of "__tmp_kernel_none_f0_c0_r0_g0.cu".

Exception raised from invoke at /opt/pytorch/nvfuser/csrc/executor_utils.cpp:829 (most recent call first):
frame #0: <unknown function> + 0x8e9af (0x55702292a9af in bin/test_multidevice)
frame #1: <unknown function> + 0x27f363 (0x557022b1b363 in bin/test_multidevice)
frame #2: <unknown function> + 0x2b8da0 (0x557022b54da0 in bin/test_multidevice)
frame #3: <unknown function> + 0x297a29 (0x557022b33a29 in bin/test_multidevice)
frame #4: <unknown function> + 0x498d98 (0x557022d34d98 in bin/test_multidevice)
frame #5: <unknown function> + 0x499d37 (0x557022d35d37 in bin/test_multidevice)
frame #6: <unknown function> + 0x68a9d6 (0x557022f269d6 in bin/test_multidevice)
frame #7: <unknown function> + 0x6a67cd (0x557022f427cd in bin/test_multidevice)
frame #8: <unknown function> + 0x7130d1 (0x557022faf0d1 in bin/test_multidevice)
frame #9: <unknown function> + 0x6fec15 (0x557022f9ac15 in bin/test_multidevice)
frame #10: <unknown function> + 0x6ff3a2 (0x557022f9b3a2 in bin/test_multidevice)
frame #11: <unknown function> + 0x6fff2b (0x557022f9bf2b in bin/test_multidevice)
frame #12: <unknown function> + 0x708584 (0x557022fa4584 in bin/test_multidevice)
frame #13: <unknown function> + 0x6ff585 (0x557022f9b585 in bin/test_multidevice)
frame #14: <unknown function> + 0x100e64 (0x55702299ce64 in bin/test_multidevice)
frame #15: <unknown function> + 0x29d90 (0x7f1bb73d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #16: __libc_start_main + 0x80 (0x7f1bb73d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #17: <unknown function> + 0x1012a5 (0x55702299d2a5 in bin/test_multidevice)
" thrown in the test body.

[  FAILED  ] PipelineTestStagedReduction.StagedReduction/Manual, where GetParam() = Manual (27 ms)

[ RUN      ] MultideviceShardingTest.UnshardedGlobalInput/concrete_sharded_along_dim_0
unknown file: Failure
C++ exception with description "Tried to access out of boundary index 3. total index: 3
Exception raised from wrapDim at /opt/pytorch/nvfuser/csrc/utils.h:560 (most recent call first):
frame #0: <unknown function> + 0x8e9af (0x55702292a9af in bin/test_multidevice)
frame #1: <unknown function> + 0x5b7f87 (0x557022e53f87 in bin/test_multidevice)
frame #2: <unknown function> + 0x6c0804 (0x557022f5c804 in bin/test_multidevice)
frame #3: <unknown function> + 0x7130d1 (0x557022faf0d1 in bin/test_multidevice)
frame #4: <unknown function> + 0x6fec15 (0x557022f9ac15 in bin/test_multidevice)
frame #5: <unknown function> + 0x6ff3a2 (0x557022f9b3a2 in bin/test_multidevice)
frame #6: <unknown function> + 0x6fff2b (0x557022f9bf2b in bin/test_multidevice)
frame #7: <unknown function> + 0x708584 (0x557022fa4584 in bin/test_multidevice)
frame #8: <unknown function> + 0x6ff585 (0x557022f9b585 in bin/test_multidevice)
frame #9: <unknown function> + 0x100e64 (0x55702299ce64 in bin/test_multidevice)
frame #10: <unknown function> + 0x29d90 (0x7f1bb73d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #11: __libc_start_main + 0x80 (0x7f1bb73d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x1012a5 (0x55702299d2a5 in bin/test_multidevice)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1715814911 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='MultideviceShardingTest.UnshardedGlobalInput/concrete_sharded_along_dim_0'
[  FAILED  ] MultideviceShardingTest.UnshardedGlobalInput/concrete_sharded_along_dim_0, where GetParam() = (true, 0) (0 ms)
[ RUN      ] MultideviceShardingTest.UnshardedGlobalInput/concrete_sharded_along_dim_1
To reproduce: NVFUSER_TEST_RANDOM_SEED=1715814911 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='MultideviceShardingTest.UnshardedGlobalInput/concrete_sharded_along_dim_1'
unknown file: Failure
C++ exception with description "Tried to access out of boundary index 3. total index: 3
Exception raised from wrapDim at /opt/pytorch/nvfuser/csrc/utils.h:560 (most recent call first):
frame #0: <unknown function> + 0x8e9af (0x55702292a9af in bin/test_multidevice)
frame #1: <unknown function> + 0x5b7f87 (0x557022e53f87 in bin/test_multidevice)
frame #2: <unknown function> + 0x6c0804 (0x557022f5c804 in bin/test_multidevice)
frame #3: <unknown function> + 0x7130d1 (0x557022faf0d1 in bin/test_multidevice)
frame #4: <unknown function> + 0x6fec15 (0x557022f9ac15 in bin/test_multidevice)
frame #5: <unknown function> + 0x6ff3a2 (0x557022f9b3a2 in bin/test_multidevice)
frame #6: <unknown function> + 0x6fff2b (0x557022f9bf2b in bin/test_multidevice)
frame #7: <unknown function> + 0x708584 (0x557022fa4584 in bin/test_multidevice)
frame #8: <unknown function> + 0x6ff585 (0x557022f9b585 in bin/test_multidevice)
frame #9: <unknown function> + 0x100e64 (0x55702299ce64 in bin/test_multidevice)
frame #10: <unknown function> + 0x29d90 (0x7f1bb73d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #11: __libc_start_main + 0x80 (0x7f1bb73d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x1012a5 (0x55702299d2a5 in bin/test_multidevice)
" thrown in the test body.

[  FAILED  ] MultideviceShardingTest.UnshardedGlobalInput/concrete_sharded_along_dim_1, where GetParam() = (true, 1) (0 ms)

The text was updated successfully, but these errors were encountered:

wujingyue · 2024-05-15T23:32:10Z

I'll take a look at PipelineTestStagedReduction.StagedReduction/Manual, where GetParam() = Manual. @samnordmann and @cowanmeg, do you want to look into other tests? Although we don't have to fix them, I suspect they may indicate true bugs that'll bite us in the future.

For #2257.

wujingyue added a commit that referenced this issue Aug 7, 2024

Fix some multidevice tests that failed with one GPU.

00272d8

For #2257.

wujingyue mentioned this issue Aug 7, 2024

Fix some multidevice tests that failed with one GPU. #2777

Merged

This was referenced Aug 26, 2024

Compilation failure in StagedReduction test #2846

Closed

Fix the StagedReduction test when -np 1. #2878

Merged

wujingyue self-assigned this Aug 31, 2024

wujingyue closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Five distributed tests are failing on one GPU. #2257

Five distributed tests are failing on one GPU. #2257

wujingyue commented May 15, 2024

wujingyue commented May 15, 2024

Five distributed tests are failing on one GPU. #2257

Five distributed tests are failing on one GPU. #2257

Comments

wujingyue commented May 15, 2024

wujingyue commented May 15, 2024