Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OMPT][Trunk] Offloading to x86_64 misses some OMPT target callbacks #64487

Closed
Thyre opened this issue Aug 7, 2023 · 15 comments
Closed

[OMPT][Trunk] Offloading to x86_64 misses some OMPT target callbacks #64487

Thyre opened this issue Aug 7, 2023 · 15 comments
Assignees
Labels

Comments

@Thyre
Copy link

Thyre commented Aug 7, 2023

Description

LLVM Trunk has added first support for OMPT callbacks for target directives. During testing, I noticed that callbacks for offloading to host are dispatched as well. While that's fine, the callbacks ompt_callback_device_initialize, ompt_callback_device_load, ompt_callback_device_unload and ompt_callback_device_finalize are not dispatched.

The callback ompt_callback_device_unload is not implemented as far as I know, but I would expect the others to show up.

Missing ompt_callback_device_initialize is against the OpenMP specifications, which state [Link]:

The OpenMP implementation invokes this callback after OpenMP is initialized for the device but before execution of any OpenMP construct is started on the device.

NVHPC 23.7 also dispatches ompt_callback_target without ompt_callback_device_initialize. However, in their case one can identify the host execution by checking the device_num against the returned value of ompt_get_num_devices which is either negative, or above ompt_get_num_devices. In the case of LLVM, offloading to host seems to initialize four devices, which are handled just like offloading to GPUs. Therefore, we get normal device numbers. We can verify this by running llvm-omp-device-info

$ llvm-omp-device-info
Device (0):
    Device Type    Generic-elf-64bit

Device (1):
    Device Type    Generic-elf-64bit

Device (2):
    Device Type    Generic-elf-64bit

Device (3):
    Device Type    Generic-elf-64bit

Device (4):
    CUDA Driver Version              12020
    CUDA OpenMP Device Number        0
    Device Name                      NVIDIA GeForce MX550
    Global Memory Size               94779004878848 bytes
    Number of Multiprocessors        16
    Concurrent Copy and Execution    Yes
    Total Constant Memory            65536 bytes
    Max Shared Memory per Block      49152 bytes
    Registers per Block              65536
    Warp Size                        32
    Maximum Threads per Block        1024
    Maximum Block Dimensions         
        x                            1024
        y                            1024
        z                            64
    Maximum Grid Dimensions          
        x                            2147483647
        y                            65535
        z                            65535
    Maximum Memory Pitch             2147483647 bytes
    Texture Alignment                512 bytes
    Clock Rate                       1320000 kHz
    Execution Timeout                No
    Integrated Device                No
    Can Map Host Memory              Yes
    Compute Mode                     Default
    Concurrent Kernels               Yes
    ECC Enabled                      No
    Memory Clock Rate                6001000 kHz
    Memory Bus Width                 64 bits
    L2 Cache Size                    524288 bytes
    Max Threads Per SMP              1024
    Async Engines                    3
    Unified Addressing               Yes
    Managed Memory                   Yes
    Concurrent Managed Memory        Yes
    Preemption Supported             Yes
    Cooperative Launch               Yes
    Multi-Device Boars               No
    Compute Capabilities             sm_75

Reproducer

I used one of the aomp smoke tests veccopy-ompt-target-emi to verify this issue.

When compiling and running the code with the offload target nvptx64, the following output can be seen:

$ clang --version
clang version 18.0.0 (https://github.com/llvm/llvm-project.git 52ac71f92d38f75df5cb88e9c090ac5fd5a71548)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/software/software/LLVM/git/bin
$ clang -fopenmp -fopenmp-targets=nvptx64 test.c   
$ ./a.out
Callback Init: device_num=0 type=sm_75 device=0x5593d65d44c0 lookup=0x7fef929e0480 doc=(nil)
Callback Load: device_num:0 filename:(null) host_adddr:0x5593d5e8e758 device_addr:(nil) bytes:758224
Callback Target EMI: kind=1 endpoint=1 device_num=0 task_data=0x5593d6591540 (0x0) target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) code=0x5593d5e8c9e2
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000002) src=0x7fff98ef29d0 src_device_num=1 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fef928eb383
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000002) src=0x7fff98ef29d0 src_device_num=1 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fef928eb383
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000003) src=0x7fff98ef29d0 src_device_num=1 dest=0x7fef60600000 dest_device_num=0 bytes=4000 code=0x7fef928eb2fe
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000003) src=0x7fff98ef29d0 src_device_num=1 dest=0x7fef60600000 dest_device_num=0 bytes=4000 code=0x7fef928eb2fe
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000004) src=0x7fff98ef1a30 src_device_num=1 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fef928eb383
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000004) src=0x7fff98ef1a30 src_device_num=1 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fef928eb383
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000005) src=0x7fff98ef1a30 src_device_num=1 dest=0x7fef60601000 dest_device_num=0 bytes=4000 code=0x7fef928eb2fe
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000005) src=0x7fff98ef1a30 src_device_num=1 dest=0x7fef60601000 dest_device_num=0 bytes=4000 code=0x7fef928eb2fe
  Callback Submit EMI: endpoint=1  req_num_teams=1 target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7a0 (0x0)
  Callback Submit EMI: endpoint=2  req_num_teams=1 target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7a0 (0x0)
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000006) src=0x7fef60601000 src_device_num=0 dest=0x7fff98ef1a30 dest_device_num=1 bytes=4000 code=0x7fef928f457f
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000006) src=0x7fef60601000 src_device_num=0 dest=0x7fff98ef1a30 dest_device_num=1 bytes=4000 code=0x7fef928f457f
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000007) src=0x7fef60600000 src_device_num=0 dest=0x7fff98ef29d0 dest_device_num=1 bytes=4000 code=0x7fef928f457f
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000007) src=0x7fef60600000 src_device_num=0 dest=0x7fff98ef29d0 dest_device_num=1 bytes=4000 code=0x7fef928f457f
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000008) src=0x7fef60601000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fef928ec73a
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000008) src=0x7fef60601000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fef928ec73a
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000009) src=0x7fef60600000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fef928ec73a
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) host_op_id=0x7fef9282a7c0 (0x8000000000000009) src=0x7fef60600000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fef928ec73a
Callback Target EMI: kind=1 endpoint=2 device_num=0 task_data=0x5593d6591540 (0x0) target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x8000000000000001) code=0x5593d5e8c9e2
Callback Target EMI: kind=1 endpoint=1 device_num=0 task_data=0x5593d6591540 (0x0) target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) code=0x5593d5e8cbfe
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x800000000000000b) src=0x7fff98ef29d0 src_device_num=1 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fef928eb383
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x800000000000000b) src=0x7fff98ef29d0 src_device_num=1 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fef928eb383
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x800000000000000c) src=0x7fff98ef29d0 src_device_num=1 dest=0x7fef60601000 dest_device_num=0 bytes=4000 code=0x7fef928eb2fe
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x800000000000000c) src=0x7fff98ef29d0 src_device_num=1 dest=0x7fef60601000 dest_device_num=0 bytes=4000 code=0x7fef928eb2fe
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x800000000000000d) src=0x7fff98ef1a30 src_device_num=1 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fef928eb383
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x800000000000000d) src=0x7fff98ef1a30 src_device_num=1 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fef928eb383
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x800000000000000e) src=0x7fff98ef1a30 src_device_num=1 dest=0x7fef60600000 dest_device_num=0 bytes=4000 code=0x7fef928eb2fe
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x800000000000000e) src=0x7fff98ef1a30 src_device_num=1 dest=0x7fef60600000 dest_device_num=0 bytes=4000 code=0x7fef928eb2fe
  Callback Submit EMI: endpoint=1  req_num_teams=0 target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7a0 (0x0)
  Callback Submit EMI: endpoint=2  req_num_teams=0 target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7a0 (0x0)
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x800000000000000f) src=0x7fef60600000 src_device_num=0 dest=0x7fff98ef1a30 dest_device_num=1 bytes=4000 code=0x7fef928f457f
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x800000000000000f) src=0x7fef60600000 src_device_num=0 dest=0x7fff98ef1a30 dest_device_num=1 bytes=4000 code=0x7fef928f457f
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x8000000000000010) src=0x7fef60601000 src_device_num=0 dest=0x7fff98ef29d0 dest_device_num=1 bytes=4000 code=0x7fef928f457f
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x8000000000000010) src=0x7fef60601000 src_device_num=0 dest=0x7fff98ef29d0 dest_device_num=1 bytes=4000 code=0x7fef928f457f
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x8000000000000011) src=0x7fef60600000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fef928ec73a
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x8000000000000011) src=0x7fef60600000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fef928ec73a
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x8000000000000012) src=0x7fef60601000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fef928ec73a
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) host_op_id=0x7fef9282a7c0 (0x8000000000000012) src=0x7fef60601000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fef928ec73a
Callback Target EMI: kind=1 endpoint=2 device_num=0 task_data=0x5593d6591540 (0x0) target_task_data=0x5593d65ba818 (0x0) target_data=0x7fef9282a7a8 (0x800000000000000a) code=0x5593d5e8cbfe
Success
Callback Fini: device_num=0

Replacing nvptx64 with x86_64, one can see the follwing:

$ clang --version
clang version 18.0.0 (https://github.com/llvm/llvm-project.git 52ac71f92d38f75df5cb88e9c090ac5fd5a71548)
Target: x86_64-unknown-linux-gnu
Thread model: posix
$ clang -fopenmp -fopenmp-targets=x86_64 test.c   
$ ./a.out
Callback Target EMI: kind=1 endpoint=1 device_num=0 task_data=0x55a83389d540 (0x0) target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) code=0x55a8336389e2
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000002) src=0x7fffc2fe7820 src_device_num=4 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fa26336f383
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000002) src=0x7fffc2fe7820 src_device_num=4 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fa26336f383
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000003) src=0x7fffc2fe7820 src_device_num=4 dest=0x55a8338f82d0 dest_device_num=0 bytes=4000 code=0x7fa26336f2fe
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000003) src=0x7fffc2fe7820 src_device_num=4 dest=0x55a8338f82d0 dest_device_num=0 bytes=4000 code=0x7fa26336f2fe
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000004) src=0x7fffc2fe6880 src_device_num=4 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fa26336f383
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000004) src=0x7fffc2fe6880 src_device_num=4 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fa26336f383
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000005) src=0x7fffc2fe6880 src_device_num=4 dest=0x55a833902680 dest_device_num=0 bytes=4000 code=0x7fa26336f2fe
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000005) src=0x7fffc2fe6880 src_device_num=4 dest=0x55a833902680 dest_device_num=0 bytes=4000 code=0x7fa26336f2fe
  Callback Submit EMI: endpoint=1  req_num_teams=1 target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287a0 (0x0)
  Callback Submit EMI: endpoint=2  req_num_teams=1 target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287a0 (0x0)
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000006) src=0x55a833902680 src_device_num=0 dest=0x7fffc2fe6880 dest_device_num=4 bytes=4000 code=0x7fa26337857f
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000006) src=0x55a833902680 src_device_num=0 dest=0x7fffc2fe6880 dest_device_num=4 bytes=4000 code=0x7fa26337857f
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000007) src=0x55a8338f82d0 src_device_num=0 dest=0x7fffc2fe7820 dest_device_num=4 bytes=4000 code=0x7fa26337857f
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000007) src=0x55a8338f82d0 src_device_num=0 dest=0x7fffc2fe7820 dest_device_num=4 bytes=4000 code=0x7fa26337857f
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000008) src=0x55a833902680 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fa26337073a
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000008) src=0x55a833902680 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fa26337073a
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000009) src=0x55a8338f82d0 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fa26337073a
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) host_op_id=0x7fa2632287c0 (0x8000000000000009) src=0x55a8338f82d0 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fa26337073a
Callback Target EMI: kind=1 endpoint=2 device_num=0 task_data=0x55a83389d540 (0x0) target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x8000000000000001) code=0x55a8336389e2
Callback Target EMI: kind=1 endpoint=1 device_num=0 task_data=0x55a83389d540 (0x0) target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) code=0x55a833638bfe
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x800000000000000b) src=0x7fffc2fe7820 src_device_num=4 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fa26336f383
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x800000000000000b) src=0x7fffc2fe7820 src_device_num=4 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fa26336f383
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x800000000000000c) src=0x7fffc2fe7820 src_device_num=4 dest=0x55a833902680 dest_device_num=0 bytes=4000 code=0x7fa26336f2fe
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x800000000000000c) src=0x7fffc2fe7820 src_device_num=4 dest=0x55a833902680 dest_device_num=0 bytes=4000 code=0x7fa26336f2fe
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x800000000000000d) src=0x7fffc2fe6880 src_device_num=4 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fa26336f383
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x800000000000000d) src=0x7fffc2fe6880 src_device_num=4 dest=(nil) dest_device_num=0 bytes=4000 code=0x7fa26336f383
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x800000000000000e) src=0x7fffc2fe6880 src_device_num=4 dest=0x55a8338f82d0 dest_device_num=0 bytes=4000 code=0x7fa26336f2fe
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x800000000000000e) src=0x7fffc2fe6880 src_device_num=4 dest=0x55a8338f82d0 dest_device_num=0 bytes=4000 code=0x7fa26336f2fe
  Callback Submit EMI: endpoint=1  req_num_teams=0 target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287a0 (0x0)
  Callback Submit EMI: endpoint=2  req_num_teams=0 target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287a0 (0x0)
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x800000000000000f) src=0x55a8338f82d0 src_device_num=0 dest=0x7fffc2fe6880 dest_device_num=4 bytes=4000 code=0x7fa26337857f
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x800000000000000f) src=0x55a8338f82d0 src_device_num=0 dest=0x7fffc2fe6880 dest_device_num=4 bytes=4000 code=0x7fa26337857f
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x8000000000000010) src=0x55a833902680 src_device_num=0 dest=0x7fffc2fe7820 dest_device_num=4 bytes=4000 code=0x7fa26337857f
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x8000000000000010) src=0x55a833902680 src_device_num=0 dest=0x7fffc2fe7820 dest_device_num=4 bytes=4000 code=0x7fa26337857f
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x8000000000000011) src=0x55a8338f82d0 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fa26337073a
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x8000000000000011) src=0x55a8338f82d0 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fa26337073a
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x8000000000000012) src=0x55a833902680 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fa26337073a
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) host_op_id=0x7fa2632287c0 (0x8000000000000012) src=0x55a833902680 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x7fa26337073a
Callback Target EMI: kind=1 endpoint=2 device_num=0 task_data=0x55a83389d540 (0x0) target_task_data=0x55a8338c6818 (0x0) target_data=0x7fa2632287a8 (0x800000000000000a) code=0x55a833638bfe
Success

Notice, that the callbacks are missing from the output.

Side question:

I noticed that omp_get_num_devices() returns a number of four for offloading to x86_64. What's the reasoning behind that? llvm-omp-device-info also shows four devices with the type Generif-elf-64bit on a system with Ubuntu 22.04 LTS, Intel Core i7-1260P.

@Thyre
Copy link
Author

Thyre commented Aug 7, 2023

Just as an addition: Leaving out -fopenmp-targets=x86_64 produces no output by the OMPT interface in the example tool. My guess is that the generated code is host only, skipping the target directive and converting it to a parallel for. Therefore, we do not get the callbacks during execution.

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 7, 2023

@llvm/issue-subscribers-openmp

@jhuber6
Copy link
Contributor

jhuber6 commented Aug 8, 2023

@mhalk Any suggestions?

@mhalk
Copy link
Contributor

mhalk commented Aug 8, 2023

Just took a quick look, I guess we have to add OMPT support for the generic-elf-64bit, so the corresponding devices will trigger Init, Load and Fini.
If these OMPT callbacks really need to be triggered, then the missing call to ompt::connectLibrary(); within GenELF64PluginTy::initImpl() is already an explanation.

But I'll have to check back with the spec & @dhruvachak how to go about this.

@mhalk
Copy link
Contributor

mhalk commented Aug 10, 2023

Poked at this, and it works ...
To a certain degree -- that is, if you're not maliciously trying to force it to produce ambiguous output.

From my perspective there are two ways to deal with the situation, either we try to omit all the callbacks generated in libomptarget, or we add the support for generic-elf-64bit.
IMHO The latter is less painful & presumably even less error-prone.

Let's say we would go with "added support":
I added prints for omp_get_device_num and omp_get_num_devices.
Also forced the target regions to use device number 3 (and 11, for the two-target-version).

Output for x86_64 only:

trunk/bin/clang -fopenmp -fopenmp-targets=x86_64 veccopy-ompt-target.c -o veccopy-ompt-target
./veccopy-ompt-target

omp_get_device_num=4
omp_get_num_devices=4

Callback Init: device_num=3 type=unknown device=0x1485000 lookup=0x7f0bd39c8a90 doc=(nil)
Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x200368 device_addr:(nil) bytes:17112
[...]
Callback Fini: device_num=3

Output for x86_64 and amdgcn-amd-amdhsa (8 gfx90a GPUs present):

trunk/bin/clang -fopenmp -fopenmp-targets=x86_64,amdgcn-amd-amdhsa veccopy-ompt-target.c -o veccopy-ompt-target
./veccopy-ompt-target

omp_get_device_num=12
omp_get_num_devices=12

Callback Init: device_num=3 type=gfx90a device=0x12ae980 lookup=0x7f441a241a90 doc=(nil)
Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x200378 device_addr:(nil) bytes:19208
[...]
Callback Init: device_num=3 type=unknown device=0x12b52a0 lookup=0x7f441a241a90 doc=(nil)
Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x204f00 device_addr:(nil) bytes:17112
[...]
Callback Fini: device_num=3
Callback Fini: device_num=3

So, in the sources, I intentionally used devices 3 and 11 for target regions.
Which both correlate to device number 3 in their respective RTL.
That is, to illustrate that there might be ambiguous output w.r.t. Fini callbacks.

Other than that, the generic CPU will always report type=unknown within the Init callback since this is the default implementation of getComputeUnitKind.
Now, TBH I have to ask around why generic-elf-64bit is (hardcoded) reporting four devices, maybe there is a reasoning behind this.

@Thyre Please, let me know what you think based on these outputs.
I would be interested to know what you expect(ed) to happen or would like to see.

@Thyre
Copy link
Author

Thyre commented Aug 11, 2023

Thanks a lot for the update 😄
It's great to see that getting the missing callbacks for generic-elf-64bit seems to be possible.

I would agree that the second option, adding support for generic-elf-64bit, would be better. Your points are very good arguments for this approach. In addition, omitting all target callbacks would potentially leave out interesting information for tools using the interface.

Judging by your output for both x86_64 and a combination of x86_64 + amdgcn, I would expect that nvptx64 + amdgcn yields the same results shown in your second example. I think that I can test this. For tool developers, the ambiguous output can be hard to work with, since we can only use the device_num to identify the executing device during target callbacks. The runtime should probably dispatch the same device number users can use to define the executing device (here 3 for the AMD GPU and 11 for the host).

Getting unknown as the device type doesn't look pretty, but I'm fine with that. Some people might get confused though. Skimming through the code a bit, we may be able to add getComputeUnitKind to generic-elf-64bit as well, right? Then, we could report it as something like generic.

@Thyre
Copy link
Author

Thyre commented Aug 11, 2023

Judging by your output for both x86_64 and a combination of x86_64 + amdgcn, I would expect that nvptx64 + amdgcn yields the same results shown in your second example. I think that I can test this. For tool developers, the ambiguous output can be hard to work with, since we can only use the device_num to identify the executing device during target callbacks. The runtime should probably dispatch the same device number users can use to define the executing device (here 3 for the AMD GPU and 11 for the host).

I checked how the device numbers are delivered on a system with an AMD + NVIDIA GPU. We can see the same behavior:

$ clang-18 -fopenmp -fopenmp-targets=nvptx64,amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a veccopy.c
$ ./a.out
Num devices = 3
Device 0
Callback Init: device_num=0 type=sm_80 device=0x556d3f652a90 lookup=0x7efd6a8027b0 doc=(nil)
Callback Load: device_num:0 filename:(null) host_adddr:0x556d3ed12778 device_addr:(nil) bytes:715888
[...]
Device 1
Callback Init: device_num=0 type=gfx90a device=0x556d402203d0 lookup=0x7efd6a8027b0 doc=(nil)
Callback Load: device_num:0 filename:(null) host_adddr:0x556d3edc1478 device_addr:(nil) bytes:25064
[...]
Device 2
Callback Init: device_num=1 type=gfx90a device=0x556d40229210 lookup=0x7efd6a8027b0 doc=(nil)
Callback Load: device_num:1 filename:(null) host_adddr:0x556d3edc1478 device_addr:(nil) bytes:25064
[...]
Success
Callback Fini: device_num=0
Callback Fini: device_num=0
Callback Fini: device_num=1

@mhalk
Copy link
Contributor

mhalk commented Aug 11, 2023

I checked how the device numbers are delivered on a system with an AMD + NVIDIA GPU. We can see the same behavior [...]

Thanks for checking this! So, I guess this behavior is okay for now? (See my next comments.)

The runtime should probably dispatch the same device number users can use to define the executing device [...]

tl;dr: Agreed. Like that idea, albeit I'd have to see if this is possible. Maybe I'm not aware of some information but ATM I'm not very confident this can be (reasonably) solved on my end.

When the callbacks are dispatched, we only have the information from the corresponding RTL.
To adhere to our example, that is e.g. DeviceId=3 -- while in the OpenMP runtime this might in fact be device number 11.
Now the callback is executed, that's it -- for the init callback we could (during execution of the corresponding callback!) find out the "actual" (whatever that means) device number since we have the kind (amdgcn, nvptx, ...) and might be able to deduce further info. But other callbacks do not have this information.

Getting unknown as the device type doesn't look pretty, but I'm fine with that. Some people might get confused though.

As you stated this can be alleviated by implementing getComputeUnitKind for generic-elf-64bit in a single line.

Generally, this is (at this stage) a very small change w.r.t. the LoC, but there definitely needs to be some discussion beforehand.
OpenMP meeting is Wednesday's, we'll bring this up -- i.e. how do we go about this.

@Thyre
Copy link
Author

Thyre commented Aug 11, 2023

Thanks for checking this! So, I guess this behavior is okay for now? (See my next comments.)

I would say that the behavior is okay for now. We should probably track this in a separate issue since it is not directly related to the offloading to host.

tl;dr: Agreed. Like that idea, albeit I'd have to see if this is possible. Maybe I'm not aware of some information but ATM I'm not very confident this can be (reasonably) solved on my end.

When the callbacks are dispatched, we only have the information from the corresponding RTL. To adhere to our example, that is e.g. DeviceId=3 -- while in the OpenMP runtime this might in fact be device number 11. Now the callback is executed, that's it -- for the init callback we could (during execution of the corresponding callback!) find out the "actual" (whatever that means) device number since we have the kind (amdgcn, nvptx, ...) and might be able to deduce further info. But other callbacks do not have this information.

I expected something like this. For now, I would say that it is not a huge deal for most users since offloading to two different architectures isn't that common and target if(false) will not trigger target directives on the host. However, it would certainly be great to have support for it at some point in the future.

As you stated this can be alleviated by implementing getComputeUnitKind for generic-elf-64bit in a single line.

Generally, this is (at this stage) a very small change w.r.t. the LoC, but there definitely needs to be some discussion beforehand. OpenMP meeting is Wednesday's, we'll bring this up -- i.e. how do we go about this.

Sure! I guess assigning a different name is also not necessarily in the scope of this issue. This can be done separately 🙂

@mhalk mhalk self-assigned this Aug 15, 2023
@mhalk
Copy link
Contributor

mhalk commented Aug 16, 2023

@Thyre FYI Just brought this up in the OpenMP meeting and I'll prepare two patches for / related to this issue.
Presumably, this should be further discussed in the next meeting since this involves changes to the generic-elf-64bit plugin.

@Thyre
Copy link
Author

Thyre commented Aug 16, 2023

@Thyre FYI Just brought this up in the OpenMP meeting and I'll prepare two patches for / related to this issue. Presumably, this should be further discussed in the next meeting since this involves changes to the generic-elf-64bit plugin.

That's great to hear, thanks 😄
I can understand that there needs to be further discussion. I shouldn't affect my development too much right now because we need to get OpenMP teams running in Score-P first before I can really test offloading to x86_64 with LLVM.

@mhalk
Copy link
Contributor

mhalk commented Aug 22, 2023

FYI The two patches are prepped for tomorrow:
https://reviews.llvm.org/D158542
https://reviews.llvm.org/D158543

@mhalk
Copy link
Contributor

mhalk commented Aug 23, 2023

As there were no objections during the meeting and the patches already got accepted, I will polish them a tiny bit and land them. Additionally, I wanted to check that we do not link OMPT all the time into every affected plugin, even when OMPT support was disabled.
If nothing unexpected happens this should be done this week.

@mhalk
Copy link
Contributor

mhalk commented Aug 25, 2023

First tiny patch has landed -- so the plugin reports a reasonable CU kind / device type=generic-64bit (e.g. during init).
For the second patch I want to get feedback / confirmation / another pair of eyes.
Currently OMPT is linked, even when its support was set to OFF.
The translation units should be empty in that case, so no big deal -- that'll be fixed: no OMPT traces in the shared object when disabled.

@mhalk mhalk closed this as completed in 9300b6d Aug 25, 2023
@Thyre
Copy link
Author

Thyre commented Aug 25, 2023

Thanks a lot for fixing the issue this quick. It will certainly help when continuing to implement both support for the teams directive and offloading in Score-P.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants