Skip to content

MultiGPU malloc_device fails for second GPU (Out of Memory) #16038

@kaanolgu

Description

@kaanolgu

Describe the bug

I am trying to use multiGPU code on AMD MI210 and I can't allocate anything on second GPU. Even N=1 sized allocation fails with HIP Out of Memory error. I tried these prebuilt nightly builds:

llvm-nighltly-2024-10-26
llvm-nighltly-2024-11-09
llvm-nighltly-2024-11-07

The code works if I use a single GPU and only allocate memory on single GPU

To reproduce

#include <sycl/sycl.hpp>
#include <iostream>

#define NUM_GPU 2
int main() {

  auto Devs = sycl::device::get_devices(sycl::info::device_type::gpu);


  std::vector<sycl::queue> Queues;
  // Insert not all devices only the required ones for model


  std::transform(Devs.begin(), Devs.begin() + NUM_GPU, std::back_inserter(Queues),
                 [](const sycl::device &D) { return sycl::queue{D}; });

  ////////////////////////////////////////////////////////////////////////
  if (Devs.size() > 1){
  if (!Devs[0].ext_oneapi_can_access_peer(
          Devs[1], sycl::ext::oneapi::peer_access::access_supported)) {
    std::cout << "P2P access is not supported by devices, exiting."
              << std::endl;

  }
  }
    std::cout <<"\n----------------------------------------"<< std::endl;

    std::cout << "Running on devices:" << std::endl;
    for(int i =0; i < Queues.size(); i++){

    std::cout << i << ":\t" << Queues[i].get_device().get_info<sycl::info::device::name>()
              << std::endl;
    }
std::cout <<"----------------------------------------"<< std::endl;


const int N = 1;
std::vector<uint32_t> data_host(N);
std::vector<uint32_t> data_host_2(N);

uint32_t* data_dev     = malloc_device<uint32_t>(N, Queues[0]); 
uint32_t* data_dev_2     = malloc_device<uint32_t>(N, Queues[1]); 

Queues[0].memcpy(data_dev, data_host.data(), N * sizeof(uint32_t));
Queues[1].memcpy(data_dev_2, data_host_2.data(), N * sizeof(uint32_t));
}

This is the compile command:

export PATH=$DPCPP_HOME/llvm-2024-10-26/bin:$PATH;
export LD_LIBRARY_PATH=$DPCPP_HOME/llvm-2024-10-26/lib:$LD_LIBRARY_PATH
sycl-ls
rm -rf a.out;clang++ -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx90a atomic_test.cpp ; ./a.out 

This is the output :

$ ./build.sh
[hip:gpu][hip:0] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
[hip:gpu][hip:1] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
[hip:gpu][hip:2] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]

----------------------------------------
Running on devices:
0:      AMD Instinct MI210
1:      AMD Instinct MI210
----------------------------------------
<HIP>[ERROR]: 
UR HIP ERROR:
        Value:           2
        Name:            hipErrorOutOfMemory
        Description:     out of memory
        Function:        getNextTransferStream
        Source Location: /__w/llvm/llvm/build/_deps/unified-runtime-src/source/adapters/hip/queue.cpp:106

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 38 (UR_RESULT_ERROR_OUT_OF_HOST_MEMORY)
./build.sh: line 16: 542041 Aborted                 (core dumped) ./a.out

Environment

Device  [Model : Revision]    Temp    Power  Partitions      SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
        Name (20 chars)       (Edge)  (Avg)  (Mem, Compute)                                                   
==============================================================================================================
0       [0x0c34 : 0x02]       39.0°C  41.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
        Instinct MI210                                                                                        
1       [0x0c34 : 0x02]       39.0°C  43.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
        Instinct MI210                                                                                        
2       [0x0c34 : 0x02]       34.0°C  42.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
        Instinct MI210                                                                                        

sycl-ls:

[hip:gpu][hip:0] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
[hip:gpu][hip:1] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
[hip:gpu][hip:2] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]

Platforms: 1
Platform [#1]:
    Version  : HIP 60140.9
    Name     : AMD HIP BACKEND
    Vendor   : AMD Corporation
    Devices  : 3
        Device [#0]:
        Type              : gpu
        Version           : gfx90a:sramecc+:xnack-
        Name              : AMD Instinct MI210
        Vendor            : AMD Corporation
        Driver            : HIP 60140.9
        UUID              : 4950100981019857102575252102995310254
        DeviceID          : 0
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_queue_profiling_tag
        info::device::sub_group_sizes: 64
        Architecture: amd_gpu_gfx90a
        Device [#1]:
        Type              : gpu
        Version           : gfx90a:sramecc+:xnack-
        Name              : AMD Instinct MI210
        Vendor            : AMD Corporation
        Driver            : HIP 60140.9
        UUID              : 5548569810097102505099994910097100100
        DeviceID          : 0
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_queue_profiling_tag
        info::device::sub_group_sizes: 64
        Architecture: amd_gpu_gfx90a
        Device [#2]:
        Type              : gpu
        Version           : gfx90a:sramecc+:xnack-
        Name              : AMD Instinct MI210
        Vendor            : AMD Corporation
        Driver            : HIP 60140.9
        UUID              : 1014852485797999898535199551015355
        DeviceID          : 0
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_queue_profiling_tag
        info::device::sub_group_sizes: 64
        Architecture: amd_gpu_gfx90a
default_selector()      : gpu, AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
accelerator_selector()  : No device of requested type available.
cpu_selector()          : No device of requested type available.
gpu_selector()          : gpu, AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
custom_selector(gpu)    : gpu, AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
custom_selector(cpu)    : No device of requested type available.
custom_selector(acc)    : No device of requested type available.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghipIssues related to execution on HIP backend.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions