Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Transfer on A770 16 GB Fails Unit Tests and has Incomplete Level Zero/OpenCL API #618

Closed
BA8F0D39 opened this issue Feb 19, 2023 · 14 comments

Comments

@BA8F0D39
Copy link

I have a A770 16 GB and I installed intel-compute-runtime 22.43.24595.30 and intel extensions for pytorch v1.13.10+xpu on Linux Kernel 6.2rc8

I made a script to test memory transfers from GPU to GPU on the A770.

import torch
import torchvision

import intel_extension_for_pytorch as ipex

import time

sizes = [1000, 1000,1000,1000,1000,2000, 3000, 10000, 20000,30000,40000,42000,43000,44000,45000,]

for size in sizes:

    array0 = torch.rand(size, size, dtype=torch.bfloat16, device='xpu')
    array1 = torch.rand(size, size, dtype=torch.bfloat16, device='xpu')

    torch.xpu.synchronize()
    start = time.time()
    array0 = torch.clone(array1)
    torch.xpu.synchronize()
    end = time.time()
    transferrate = (size*size*16)/(end - start)
    datasize = (size*size*16)
    print("==========")
    print("Transfering " + str(datasize/8E9) + " GB")
    print("Bandwidth " + str(transferrate/8E9) + " GB/s")
    print("==========")

    torch.xpu.empty_cache()

Running the script gives a maximum GPU to GPU transfer rate of around 100 GB/s

==========
Transfering 0.002 GB
Bandwidth 29.746836879432625 GB/s
==========
==========
Transfering 0.008 GB
Bandwidth 50.45779248120301 GB/s
==========
==========
Transfering 0.018 GB
Bandwidth 71.29128611898017 GB/s
==========
==========
Transfering 0.2 GB
Bandwidth 80.71401905128451 GB/s
==========
==========
Transfering 0.8 GB
Bandwidth 91.80419151846785 GB/s
==========
==========
Transfering 1.8 GB
Bandwidth 102.16166711772665 GB/s
==========
==========
Transfering 3.2 GB
Bandwidth 103.13888713854288 GB/s
==========
==========
Transfering 3.528 GB
Bandwidth 102.9577837521917 GB/s
==========

Why is the GPU to GPU transfer rate limited to 100 GB/s, when the A770 16 GB's bandwidth is 512 GB/s?

@BartusW
Copy link
Contributor

BartusW commented Feb 20, 2023

Thank you for your feedback. Performance numbers are collected with high level PyTorch S/W stack.
Please try to remove high level abstraction layers and use OpenCL or Level-Zero native API on your system and report score.
You can use:

@BA8F0D39
Copy link
Author

BA8F0D39 commented Feb 20, 2023

@BartusW

clpeak has weird results, where the enqueueWriteBuffer is 2x faster than enqueueReadBuffer? Why is there a difference in reads and writes? Every other GPU has the approximately the same read and write speed

Platform: Intel(R) OpenCL HD Graphics
  Device: Intel(R) Graphics [0x56a0]
    Driver version  : 22.43.30 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz

    Global memory bandwidth (GBPS)
      float   : 397.87
      float2  : 403.63
      float4  : 407.18
      float8  : 416.18
      float16 : 421.80

    Single-precision compute (GFLOPS)
      float   : 13017.51
      float2  : 11136.49
      float4  : 10402.49
      float8  : 10026.09
      float16 : 9695.57

    Half-precision compute (GFLOPS)
      half   : 19543.72
      half2  : 19489.39
      half4  : 19523.66
      half8  : 19454.95
      half16 : 19336.14

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 4380.31
      int2  : 4385.50
      int4  : 4403.38
      int8  : 4273.37
      int16 : 5004.16

    Integer compute Fast 24bit (GIOPS)
      int   : 4361.75
      int2  : 4369.68
      int4  : 4387.98
      int8  : 4265.73
      int16 : 4995.43

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 21.64
      enqueueReadBuffer               : 8.92
      enqueueWriteBuffer non-blocking : 22.81
      enqueueReadBuffer non-blocking  : 9.10
      enqueueMapBuffer(for read)      : 20.58
        memcpy from mapped ptr        : 22.62
      enqueueUnmap(after write)       : 23.62
        memcpy to mapped ptr          : 22.44

    Kernel launch latency : 34.76 us

Also Kernel Latency of 34.76 us in A770 16 GB is 10x larger than RTX 2080 SUPER of 3.46 us
https://github.com/krrishnarraj/clpeak/blob/master/results/NVIDIA_CUDA/GeForce_RTX_2080_Super.log

#600

@BA8F0D39
Copy link
Author

@BartusW

I compiled and ran memory_benchmark_l0 and memory_benchmark_ocl
https://github.com/intel/compute-benchmarks

I attached the results as .txt files and the memory benchmarks have nonsensical outputs such as inf, negative numbers, ERROR and many others.


[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/49
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/51
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/53
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/55
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/57
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/59
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/61
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/63
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/65
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/67
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/69
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/71
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/73
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/75
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/77
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/79
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/81
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/83
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/85
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/87
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/89
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/91
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/93
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/95
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/72
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/76
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/80
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/84
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/88
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/92
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/96
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/100
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/104
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/108
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/112
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/116
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/120
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/124
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/128
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

api_overhead_benchmark_l0.txt
api_overhead_benchmark_ocl.txt
multitile_memory_benchmark_l0.txt

multitile_memory_benchmark_ocl.txt
gpu_cmds_benchmark_l0.txt
show_devices_l0.txt
memory_benchmark_l0.txt
show_devices_ocl.txt
memory_benchmark_ocl.txt

@BA8F0D39 BA8F0D39 changed the title GPU to GPU Memory Bandwidth Limited on A770 16 GB Memory Transfer on A770 16 GB Fails Unit Tests and has Incomplete Level Zero/OpenCL API Feb 21, 2023
@BartusW
Copy link
Contributor

BartusW commented Feb 21, 2023

Thanks for follow up and additional experiments.
OpenCL API tests with clPeak, and Device-2-Device memory bandwidth scenarios looks healthy, but we can see 6.2 rc8 performance discrepancies against kernel launch latency. This could be one of the reason of low end-to-end efficiency observed above PyTorch layer.
We are queueing this github issue as our internal task for reproduction and analyzes.
We keep the issue as 'open' until confirm and deliver potential fix.

We are also working to confirm Level-Zero failures, which are not visible in Intel CI environment.

@FreddieWitherden
Copy link

Thanks for follow up and additional experiments. OpenCL API tests with clPeak, and Device-2-Device memory bandwidth scenarios looks healthy.

For FP32 clPeak maxes out at ~13 TFLOP/s. But 512 (EUs) * 2400 (Mhz) * 8 (SIMD length) * 2 (FMA) = ~20 TFLOP/s. For comparison the same clPeak FP32 kernels on an iGPU get the expected peak. Similarly, FP16 is exactly a factor of two off of the theoretical peak.

@BA8F0D39
Copy link
Author

BA8F0D39 commented Feb 21, 2023

@BartusW

memory_benchmark_ocl has a few errors and crashed my system. I think the Level Zero/OpenCL API returns invalid values too, which causes many applications like pytorch/tensorflow to behave weirdly.

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/1
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/3
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/5
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/7
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

@BartusW
Copy link
Contributor

BartusW commented Feb 22, 2023

For FP32 clPeak maxes out at ~13 TFLOP/s. But 512 (EUs) * 2400 (Mhz) * 8 (SIMD length) * 2 (FMA) = ~20 TFLOP/s. For comparison the same clPeak FP32 kernels on an iGPU get the expected peak. Similarly, FP16 is exactly a factor of two off of the theoretical peak.

clPeak SP32 FMA (a.k.a MAD multiply+add) synthetic test observations are correct. In case of Arc family, MAD instruction is split into two ticks, while in mentioned integrated GFX there are fused two ops per single instruction in one tick in this test.
Observed MAD throughput is exposed only in synthetic cases while there is no penalty in real scenarios.

@FreddieWitherden
Copy link

clPeak SP32 FMA (a.k.a MAD multiply+add) synthetic test observations are correct. In case of Arc family, MAD instruction is split into two ticks, while in mentioned integrated GFX there are fused two ops per single instruction in one tick in this test. Observed MAD throughput is exposed only in synthetic cases while there is no penalty in real scenarios.

So as a point of reference on a 96 EU iGPU my SGEMM kernels (derived from intel/intel-graphics-compiler#254) can get 1.5 TFLOP/s or ~75% of peak. On my A770M (~16 TFLOP/s peak) and an appropriately scaled up problem the same kernels strike out at ~6.3 TFLOP/s, or ~37.5% of peak. Precisely a factor of two off what an iGPU can benchmark, and inline with the synthetics.

@BartusW
Copy link
Contributor

BartusW commented Feb 28, 2023

@BartusW

memory_benchmark_ocl has a few errors and crashed my system. I think the Level Zero/OpenCL API returns invalid values too, which causes many applications like pytorch/tensorflow to behave weirdly.

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/1
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/3
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/5
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/7
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

Hello,
Intel team quickly checked latest UMD compute driver release with generic 6.2.RC6 kernel and we can't confirm mentioned failures. However, we focused on latest release rather that visiting back previous driver package mentioned in the issue: package 22.43.24595.30 and RC8.

Please update driver to the latest release: 22.49.25018.24 https://github.com/intel/compute-runtime/releases
Mentioned failures highlighted above manifested in generic path, driver refresh could be a remedy here.

@BA8F0D39
Copy link
Author

BA8F0D39 commented Mar 1, 2023

@BartusW

I upgrade to 22.49.25018.24 and kernel 6.2.1 but the errors still persists. I have an i5-13600K raptor lake CPU and 64 GB of RAM. Resizable bar is enabled and Arc A770 is the only OpenCL device available on the system.

03:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
	Subsystem: Intel Corporation Device 1020
	Flags: bus master, fast devsel, latency 0, IRQ 172
	Memory at 84000000 (64-bit, non-prefetchable) [size=16M]
	Memory at 4000000000 (64-bit, prefetchable) [size=16G]
	Expansion ROM at 85000000 [disabled] [size=2M]
	Capabilities: [40] Vendor Specific Information: Len=0c <?>
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+


sudo dpkg -i *.deb
(Reading database ... 22170 files and directories currently installed.)
Preparing to unpack intel-igc-core_1.0.12812.24_amd64.deb ...
Unpacking intel-igc-core (1.0.12812.24) over (1.0.12812.24) ...
Preparing to unpack intel-igc-opencl_1.0.12812.24_amd64.deb ...
Unpacking intel-igc-opencl (1.0.12812.24) over (1.0.12812.24) ...
Preparing to unpack intel-level-zero-gpu_1.3.25018.24_amd64.deb ...
Unpacking intel-level-zero-gpu (1.3.25018.24) over (1.3.25018.24) ...
Preparing to unpack intel-opencl-icd_22.49.25018.24_amd64.deb ...
Unpacking intel-opencl-icd (22.49.25018.24) over (22.49.25018.24) ...
dpkg: warning: downgrading libigdgmm12:amd64 from 22.3.3+i550~22.04 to 22.3.0
Preparing to unpack libigdgmm12_22.3.0_amd64.deb ...
Unpacking libigdgmm12:amd64 (22.3.0) over (22.3.3+i550~22.04) ...
Setting up intel-igc-core (1.0.12812.24) ...
Setting up intel-igc-opencl (1.0.12812.24) ...
Setting up libigdgmm12:amd64 (22.3.0) ...
Setting up intel-level-zero-gpu (1.3.25018.24) ...
Setting up intel-opencl-icd (22.49.25018.24) ...
Processing triggers for libc-bin (2.35-0ubuntu3.1) ...

clinfo

Number of platforms                               1
  Platform Name                                   Intel(R) OpenCL HD Graphics
  Platform Vendor                                 Intel(R) Corporation
  Platform Version                                OpenCL 3.0 
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_byte_addressable_store cl_khr_device_uuid cl_khr_fp16 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_intel_command_queue_families cl_intel_subgroups cl_intel_required_subgroup_size cl_intel_subgroups_short cl_khr_spir cl_intel_accelerator cl_intel_driver_diagnostics cl_khr_priority_hints cl_khr_throttle_hints cl_khr_create_command_queue cl_intel_subgroups_char cl_intel_subgroups_long cl_khr_il_program cl_intel_mem_force_host_memory cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_subgroup_non_uniform_arithmetic cl_khr_subgroup_shuffle cl_khr_subgroup_shuffle_relative cl_khr_subgroup_clustered_reduce cl_intel_device_attribute_query cl_khr_suggested_local_work_size cl_intel_split_work_group_barrier cl_intel_spirv_media_block_io cl_intel_spirv_subgroups cl_khr_spirv_no_integer_wrap_decoration cl_intel_unified_shared_memory cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_intel_planar_yuv cl_intel_packed_yuv cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_3d_image_writes cl_intel_media_block_io cl_intel_bfloat16_conversions cl_intel_va_api_media_sharing cl_intel_sharing_format_query cl_khr_pci_bus_info cl_intel_create_buffer_with_properties cl_intel_dot_accumulate cl_intel_subgroup_local_block_io cl_intel_subgroup_matrix_multiply_accumulate cl_intel_subgroup_split_matrix_multiply_accumulate 
  Platform Extensions with Version                cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                                  cl_khr_device_uuid                                               0x400000 (1.0.0)
                                                  cl_khr_fp16                                                      0x400000 (1.0.0)
                                                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                                  cl_khr_icd                                                       0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                                  cl_intel_command_queue_families                                  0x400000 (1.0.0)
                                                  cl_intel_subgroups                                               0x400000 (1.0.0)
                                                  cl_intel_required_subgroup_size                                  0x400000 (1.0.0)
                                                  cl_intel_subgroups_short                                         0x400000 (1.0.0)
                                                  cl_khr_spir                                                      0x400000 (1.0.0)
                                                  cl_intel_accelerator                                             0x400000 (1.0.0)
                                                  cl_intel_driver_diagnostics                                      0x400000 (1.0.0)
                                                  cl_khr_priority_hints                                            0x400000 (1.0.0)
                                                  cl_khr_throttle_hints                                            0x400000 (1.0.0)
                                                  cl_khr_create_command_queue                                      0x400000 (1.0.0)
                                                  cl_intel_subgroups_char                                          0x400000 (1.0.0)
                                                  cl_intel_subgroups_long                                          0x400000 (1.0.0)
                                                  cl_khr_il_program                                                0x400000 (1.0.0)
                                                  cl_intel_mem_force_host_memory                                   0x400000 (1.0.0)
                                                  cl_khr_subgroup_extended_types                                   0x400000 (1.0.0)
                                                  cl_khr_subgroup_non_uniform_vote                                 0x400000 (1.0.0)
                                                  cl_khr_subgroup_ballot                                           0x400000 (1.0.0)
                                                  cl_khr_subgroup_non_uniform_arithmetic                           0x400000 (1.0.0)
                                                  cl_khr_subgroup_shuffle                                          0x400000 (1.0.0)
                                                  cl_khr_subgroup_shuffle_relative                                 0x400000 (1.0.0)
                                                  cl_khr_subgroup_clustered_reduce                                 0x400000 (1.0.0)
                                                  cl_intel_device_attribute_query                                  0x400000 (1.0.0)
                                                  cl_khr_suggested_local_work_size                                 0x400000 (1.0.0)
                                                  cl_intel_split_work_group_barrier                                0x400000 (1.0.0)
                                                  cl_intel_spirv_media_block_io                                    0x400000 (1.0.0)
                                                  cl_intel_spirv_subgroups                                         0x400000 (1.0.0)
                                                  cl_khr_spirv_no_integer_wrap_decoration                          0x400000 (1.0.0)
                                                  cl_intel_unified_shared_memory                                   0x400000 (1.0.0)
                                                  cl_khr_mipmap_image                                              0x400000 (1.0.0)
                                                  cl_khr_mipmap_image_writes                                       0x400000 (1.0.0)
                                                  cl_intel_planar_yuv                                              0x400000 (1.0.0)
                                                  cl_intel_packed_yuv                                              0x400000 (1.0.0)
                                                  cl_khr_int64_base_atomics                                        0x400000 (1.0.0)
                                                  cl_khr_int64_extended_atomics                                    0x400000 (1.0.0)
                                                  cl_khr_image2d_from_buffer                                       0x400000 (1.0.0)
                                                  cl_khr_depth_images                                              0x400000 (1.0.0)
                                                  cl_khr_3d_image_writes                                           0x400000 (1.0.0)
                                                  cl_intel_media_block_io                                          0x400000 (1.0.0)
                                                  cl_intel_bfloat16_conversions                                    0x400000 (1.0.0)
                                                  cl_intel_va_api_media_sharing                                    0x400000 (1.0.0)
                                                  cl_intel_sharing_format_query                                    0x400000 (1.0.0)
                                                  cl_khr_pci_bus_info                                              0x400000 (1.0.0)
                                                  cl_intel_create_buffer_with_properties                           0x400000 (1.0.0)
                                                  cl_intel_dot_accumulate                                          0x400000 (1.0.0)
                                                  cl_intel_subgroup_local_block_io                                 0x400000 (1.0.0)
                                                  cl_intel_subgroup_matrix_multiply_accumulate                     0x400000 (1.0.0)
                                                  cl_intel_subgroup_split_matrix_multiply_accumulate               0x400000 (1.0.0)
  Platform Numeric Version                        0xc00000 (3.0.0)
  Platform Extensions function suffix             INTEL
  Platform Host timer resolution                  1ns

  Platform Name                                   Intel(R) OpenCL HD Graphics
Number of devices                                 1
  Device Name                                     Intel(R) Graphics [0x56a0]
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 3.0 NEO 
  Device UUID                                     86800000-a056-0000-0000-000000000000
  Driver UUID                                     32322e34-392e-3235-3031-382e32340000
  Valid Device LUID                               No
  Device LUID                                     f015-652cfd7f0000
  Device Node Mask                                0
  Device Numeric Version                          0xc00000 (3.0.0)
  Driver Version                                  22.49.25018.24
  Device OpenCL C Version                         OpenCL C 1.2 
  Device OpenCL C all versions                    OpenCL C                                                         0x400000 (1.0.0)
                                                  OpenCL C                                                         0x401000 (1.1.0)
                                                  OpenCL C                                                         0x402000 (1.2.0)
                                                  OpenCL C                                                         0xc00000 (3.0.0)
  Device OpenCL C features                        __opencl_c_int64                                                 0xc00000 (3.0.0)
                                                  __opencl_c_3d_image_writes                                       0xc00000 (3.0.0)
                                                  __opencl_c_images                                                0xc00000 (3.0.0)
                                                  __opencl_c_read_write_images                                     0xc00000 (3.0.0)
                                                  __opencl_c_atomic_order_acq_rel                                  0xc00000 (3.0.0)
                                                  __opencl_c_atomic_order_seq_cst                                  0xc00000 (3.0.0)
                                                  __opencl_c_atomic_scope_all_devices                              0xc00000 (3.0.0)
                                                  __opencl_c_atomic_scope_device                                   0xc00000 (3.0.0)
                                                  __opencl_c_generic_address_space                                 0xc00000 (3.0.0)
                                                  __opencl_c_program_scope_global_variables                        0xc00000 (3.0.0)
                                                  __opencl_c_work_group_collective_functions                       0xc00000 (3.0.0)
                                                  __opencl_c_subgroups                                             0xc00000 (3.0.0)
  Latest comfornace test passed                   v2022-04-22-00
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               512
  Max clock frequency                             2400MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             1024
  Preferred work group size multiple (device)     64
  Preferred work group size multiple (kernel)     64
  Max sub-groups per work group                   128
  Sub-group sizes (Intel)                         8, 16, 32
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 1 / 1       
    half                                                 8 / 8        (cl_khr_fp16)
    float                                                1 / 1       
    double                                               1 / 1        (n/a)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (n/a)
  Address bits                                    64, Little-Endian
  Global memory size                              16225243136 (15.11GiB)
  Error Correction support                        No
  Max memory allocation                           4294959104 (4GiB)
  Unified memory for Host and Device              No
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   No
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics                 
    SVM                                           64 bytes
    Global                                        64 bytes
    Local                                         64 bytes
  Atomic memory capabilities                      relaxed, acquire/release, sequentially-consistent, work-group scope, device scope, all-devices scope
  Atomic fence capabilities                       relaxed, acquire/release, sequentially-consistent, work-item scope, work-group scope, device scope, all-devices scope
  Max size for global variable                    65536 (64KiB)
  Preferred total size of global vars             4294959104 (4GiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16777216 (16MiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            268434944 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   4 bytes
    Pitch alignment for 2D image buffers          4 pixels
    Max 2D image size                             16384x16384 pixels
    Max planar YUV image size                     16384x16128 pixels
    Max 3D image size                             16384x16384x2048 pixels
    Max number of read image args                 128
    Max number of write image args                128
    Max number of read/write image args           128
  Pipe support                                    No
  Max number of pipe args                         0
  Max active pipe reservations                    0
  Max pipe packet size                            0
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Max number of constant args                     8
  Max constant buffer size                        4294959104 (4GiB)
  Generic address space support                   Yes
  Max size of kernel argument                     2048 (2KiB)
  Queue properties (on host)                      
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Device enqueue capabilities                     (n/a)
  Queue properties (on device)                    
    Out-of-order execution                        No
    Profiling                                     No
    Preferred size                                0
    Max size                                      0
  Max queues on device                            0
  Max events on device                            0
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      52ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Non-uniform work-groups                       Yes
    Work-group collective functions               Yes
    Sub-group independent forward progress        No
    IL version                                    SPIR-V_1.2 
    ILs with version                              SPIR-V                                                           0x402000 (1.2.0)
    SPIR versions                                 1.2 
  printf() buffer size                            4194304 (4MiB)
  Built-in kernels                                (n/a)
  Built-in kernels with version                   (n/a)
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_device_uuid cl_khr_fp16 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_intel_command_queue_families cl_intel_subgroups cl_intel_required_subgroup_size cl_intel_subgroups_short cl_khr_spir cl_intel_accelerator cl_intel_driver_diagnostics cl_khr_priority_hints cl_khr_throttle_hints cl_khr_create_command_queue cl_intel_subgroups_char cl_intel_subgroups_long cl_khr_il_program cl_intel_mem_force_host_memory cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_subgroup_non_uniform_arithmetic cl_khr_subgroup_shuffle cl_khr_subgroup_shuffle_relative cl_khr_subgroup_clustered_reduce cl_intel_device_attribute_query cl_khr_suggested_local_work_size cl_intel_split_work_group_barrier cl_intel_spirv_media_block_io cl_intel_spirv_subgroups cl_khr_spirv_no_integer_wrap_decoration cl_intel_unified_shared_memory cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_intel_planar_yuv cl_intel_packed_yuv cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_3d_image_writes cl_intel_media_block_io cl_intel_bfloat16_conversions cl_intel_va_api_media_sharing cl_intel_sharing_format_query cl_khr_pci_bus_info cl_intel_create_buffer_with_properties cl_intel_dot_accumulate cl_intel_subgroup_local_block_io cl_intel_subgroup_matrix_multiply_accumulate cl_intel_subgroup_split_matrix_multiply_accumulate 
  Device Extensions with Version                  cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                                  cl_khr_device_uuid                                               0x400000 (1.0.0)
                                                  cl_khr_fp16                                                      0x400000 (1.0.0)
                                                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                                  cl_khr_icd                                                       0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                                  cl_intel_command_queue_families                                  0x400000 (1.0.0)
                                                  cl_intel_subgroups                                               0x400000 (1.0.0)
                                                  cl_intel_required_subgroup_size                                  0x400000 (1.0.0)
                                                  cl_intel_subgroups_short                                         0x400000 (1.0.0)
                                                  cl_khr_spir                                                      0x400000 (1.0.0)
                                                  cl_intel_accelerator                                             0x400000 (1.0.0)
                                                  cl_intel_driver_diagnostics                                      0x400000 (1.0.0)
                                                  cl_khr_priority_hints                                            0x400000 (1.0.0)
                                                  cl_khr_throttle_hints                                            0x400000 (1.0.0)
                                                  cl_khr_create_command_queue                                      0x400000 (1.0.0)
                                                  cl_intel_subgroups_char                                          0x400000 (1.0.0)
                                                  cl_intel_subgroups_long                                          0x400000 (1.0.0)
                                                  cl_khr_il_program                                                0x400000 (1.0.0)
                                                  cl_intel_mem_force_host_memory                                   0x400000 (1.0.0)
                                                  cl_khr_subgroup_extended_types                                   0x400000 (1.0.0)
                                                  cl_khr_subgroup_non_uniform_vote                                 0x400000 (1.0.0)
                                                  cl_khr_subgroup_ballot                                           0x400000 (1.0.0)
                                                  cl_khr_subgroup_non_uniform_arithmetic                           0x400000 (1.0.0)
                                                  cl_khr_subgroup_shuffle                                          0x400000 (1.0.0)
                                                  cl_khr_subgroup_shuffle_relative                                 0x400000 (1.0.0)
                                                  cl_khr_subgroup_clustered_reduce                                 0x400000 (1.0.0)
                                                  cl_intel_device_attribute_query                                  0x400000 (1.0.0)
                                                  cl_khr_suggested_local_work_size                                 0x400000 (1.0.0)
                                                  cl_intel_split_work_group_barrier                                0x400000 (1.0.0)
                                                  cl_intel_spirv_media_block_io                                    0x400000 (1.0.0)
                                                  cl_intel_spirv_subgroups                                         0x400000 (1.0.0)
                                                  cl_khr_spirv_no_integer_wrap_decoration                          0x400000 (1.0.0)
                                                  cl_intel_unified_shared_memory                                   0x400000 (1.0.0)
                                                  cl_khr_mipmap_image                                              0x400000 (1.0.0)
                                                  cl_khr_mipmap_image_writes                                       0x400000 (1.0.0)
                                                  cl_intel_planar_yuv                                              0x400000 (1.0.0)
                                                  cl_intel_packed_yuv                                              0x400000 (1.0.0)
                                                  cl_khr_int64_base_atomics                                        0x400000 (1.0.0)
                                                  cl_khr_int64_extended_atomics                                    0x400000 (1.0.0)
                                                  cl_khr_image2d_from_buffer                                       0x400000 (1.0.0)
                                                  cl_khr_depth_images                                              0x400000 (1.0.0)
                                                  cl_khr_3d_image_writes                                           0x400000 (1.0.0)
                                                  cl_intel_media_block_io                                          0x400000 (1.0.0)
                                                  cl_intel_bfloat16_conversions                                    0x400000 (1.0.0)
                                                  cl_intel_va_api_media_sharing                                    0x400000 (1.0.0)
                                                  cl_intel_sharing_format_query                                    0x400000 (1.0.0)
                                                  cl_khr_pci_bus_info                                              0x400000 (1.0.0)
                                                  cl_intel_create_buffer_with_properties                           0x400000 (1.0.0)
                                                  cl_intel_dot_accumulate                                          0x400000 (1.0.0)
                                                  cl_intel_subgroup_local_block_io                                 0x400000 (1.0.0)
                                                  cl_intel_subgroup_matrix_multiply_accumulate                     0x400000 (1.0.0)
                                                  cl_intel_subgroup_split_matrix_multiply_accumulate               0x400000 (1.0.0)

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Intel(R) OpenCL HD Graphics
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [INTEL]
  clCreateContext(NULL, ...) [default]            Success [INTEL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Intel(R) OpenCL HD Graphics
    Device Name                                   Intel(R) Graphics [0x56a0]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Intel(R) OpenCL HD Graphics
    Device Name                                   Intel(R) Graphics [0x56a0]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Intel(R) OpenCL HD Graphics
    Device Name                                   Intel(R) Graphics [0x56a0]

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.14
  ICD loader Profile                              OpenCL 3.0

memory_benchmark_ocl

OpenCL Platform: Intel(R) OpenCL HD Graphics
	Device: Intel(R) Graphics [0x56a0]
		driverVersion: 22.49.25018.24
		computeUnits:  512
		clockFreq:     2400
		intelProduct:  Dg2 (intelGen: XeHpgCore)

Running 10 iterations of each benchmark

                                                                                                                      TestCase           Mean         Median         StdDev            Min            Max   Type   Label [unit]
                              CopyBuffer(api=ocl size=1 contents=Zeros compressedSource=0 compressedDestination=0 useEvents=1)          0.000          0.000          6.18%          0.000          0.000  [GPU]         [GB/s]
                              CopyBuffer(api=ocl size=1 contents=Zeros compressedSource=0 compressedDestination=1 useEvents=1)          0.000          0.000          7.43%          0.000          0.000  [GPU]         [GB/s]
                              CopyBuffer(api=ocl size=1 contents=Zeros compressedSource=1 compressedDestination=0 useEvents=1)          0.000          0.000         11.52%          0.000          0.000  [GPU]         [GB/s]
                              CopyBuffer(api=ocl size=1 contents=Zeros compressedSource=1 compressedDestination=1 useEvents=1)          0.000          0.000          8.45%          0.000          0.000  [GPU]         [GB/s]
                          CopyBuffer(api=ocl size=128MB contents=Zeros compressedSource=0 compressedDestination=0 useEvents=1)        229.939        228.052          1.39%        227.408        234.954  [GPU]         [GB/s]
                          CopyBuffer(api=ocl size=128MB contents=Zeros compressedSource=0 compressedDestination=1 useEvents=1)        373.564        371.806          2.70%        352.336        390.452  [GPU]         [GB/s]
                          CopyBuffer(api=ocl size=128MB contents=Zeros compressedSource=1 compressedDestination=0 useEvents=1)        435.090        433.034          1.94%        430.214        460.176  [GPU]         [GB/s]
                          CopyBuffer(api=ocl size=128MB contents=Zeros compressedSource=1 compressedDestination=1 useEvents=1)        694.717        681.747          4.21%        676.375        754.388  [GPU]         [GB/s]
                          CopyBuffer(api=ocl size=512MB contents=Zeros compressedSource=0 compressedDestination=0 useEvents=1)        228.082        227.774          0.35%        227.297        229.739  [GPU]         [GB/s]
                          CopyBuffer(api=ocl size=512MB contents=Zeros compressedSource=0 compressedDestination=1 useEvents=1)        393.307        393.237          0.12%        392.653        394.396  [GPU]         [GB/s]
                          CopyBuffer(api=ocl size=512MB contents=Zeros compressedSource=1 compressedDestination=0 useEvents=1)        443.330        443.085          0.17%        442.324        444.691  [GPU]         [GB/s]
                          CopyBuffer(api=ocl size=512MB contents=Zeros compressedSource=1 compressedDestination=1 useEvents=1)        738.225        738.708          0.24%        734.810        741.258  [GPU]         [GB/s]
        CopyBufferRect(api=ocl size=16KB srcCompressed=0 dstCompressed=0 origin=0:0:0 region=128:128:1 rPitch=128 sPitch=16KB)          0.471          0.455          8.04%          0.448          0.568  [CPU]         [GB/s]
       CopyBufferRect(api=ocl size=2MB srcCompressed=0 dstCompressed=0 origin=0:0:0 region=128:128:128 rPitch=128 sPitch=16KB)         30.447         31.499          4.64%         27.928         31.671  [CPU]         [GB/s]
  CopyBufferRect(api=ocl size=128MB srcCompressed=0 dstCompressed=0 origin=0:0:0 region=128:1024:1024 rPitch=128 sPitch=128KB)         67.576         67.835          0.84%         66.410         67.921  [CPU]         [GB/s]
     CopyBufferRect(api=ocl size=16MB srcCompressed=0 dstCompressed=0 origin=0:0:0 region=1024:16:1024 rPitch=1KB sPitch=16KB)         54.291         54.962          3.21%         49.130         55.086  [CPU]         [GB/s]
      CopyBufferRect(api=ocl size=16MB srcCompressed=0 dstCompressed=0 origin=0:0:0 region=1024:1024:16 rPitch=1KB sPitch=1MB)         54.697         54.780          0.52%         53.906         54.970  [CPU]         [GB/s]
        CopyBufferRect(api=ocl size=16KB srcCompressed=0 dstCompressed=1 origin=0:0:0 region=128:128:1 rPitch=128 sPitch=16KB)          0.471          0.447          8.15%          0.440          0.534  [CPU]         [GB/s]
       CopyBufferRect(api=ocl size=2MB srcCompressed=0 dstCompressed=1 origin=0:0:0 region=128:128:128 rPitch=128 sPitch=16KB)         30.583         30.076          3.49%         29.816         32.942  [CPU]         [GB/s]
  CopyBufferRect(api=ocl size=128MB srcCompressed=0 dstCompressed=1 origin=0:0:0 region=128:1024:1024 rPitch=128 sPitch=128KB)         68.324         68.485          0.69%         66.936         68.584  [CPU]         [GB/s]
     CopyBufferRect(api=ocl size=16MB srcCompressed=0 dstCompressed=1 origin=0:0:0 region=1024:16:1024 rPitch=1KB sPitch=16KB)         56.026         56.575          3.24%         50.589         56.894  [CPU]         [GB/s]
      CopyBufferRect(api=ocl size=16MB srcCompressed=0 dstCompressed=1 origin=0:0:0 region=1024:1024:16 rPitch=1KB sPitch=1MB)         57.653         57.749          0.68%         56.982         58.269  [CPU]         [GB/s]
        CopyBufferRect(api=ocl size=16KB srcCompressed=1 dstCompressed=0 origin=0:0:0 region=128:128:1 rPitch=128 sPitch=16KB)          0.474          0.453          8.34%          0.440          0.537  [CPU]         [GB/s]
       CopyBufferRect(api=ocl size=2MB srcCompressed=1 dstCompressed=0 origin=0:0:0 region=128:128:128 rPitch=128 sPitch=16KB)         29.101         29.129          0.82%         28.692         29.419  [CPU]         [GB/s]
  CopyBufferRect(api=ocl size=128MB srcCompressed=1 dstCompressed=0 origin=0:0:0 region=128:1024:1024 rPitch=128 sPitch=128KB)         67.669         67.912          0.75%         66.490         67.997  [CPU]         [GB/s]
     CopyBufferRect(api=ocl size=16MB srcCompressed=1 dstCompressed=0 origin=0:0:0 region=1024:16:1024 rPitch=1KB sPitch=16KB)         54.306         54.841          3.11%         49.251         55.056  [CPU]         [GB/s]
      CopyBufferRect(api=ocl size=16MB srcCompressed=1 dstCompressed=0 origin=0:0:0 region=1024:1024:16 rPitch=1KB sPitch=1MB)         54.913         54.906          0.19%         54.732         55.053  [CPU]         [GB/s]
        CopyBufferRect(api=ocl size=16KB srcCompressed=1 dstCompressed=1 origin=0:0:0 region=128:128:1 rPitch=128 sPitch=16KB)          0.469          0.451          7.46%          0.436          0.531  [CPU]         [GB/s]
       CopyBufferRect(api=ocl size=2MB srcCompressed=1 dstCompressed=1 origin=0:0:0 region=128:128:128 rPitch=128 sPitch=16KB)         29.714         30.220          4.99%         25.307         30.478  [CPU]         [GB/s]
  CopyBufferRect(api=ocl size=128MB srcCompressed=1 dstCompressed=1 origin=0:0:0 region=128:1024:1024 rPitch=128 sPitch=128KB)         68.352         68.616          0.83%         67.174         68.715  [CPU]         [GB/s]
     CopyBufferRect(api=ocl size=16MB srcCompressed=1 dstCompressed=1 origin=0:0:0 region=1024:16:1024 rPitch=1KB sPitch=16KB)         55.961         56.466          3.26%         50.528         56.884  [CPU]         [GB/s]
      CopyBufferRect(api=ocl size=16MB srcCompressed=1 dstCompressed=1 origin=0:0:0 region=1024:1024:16 rPitch=1KB sPitch=1MB)         56.579         56.584          0.26%         56.277         56.842  [CPU]         [GB/s]
                                                             CopyEntireImage(api=ocl size=8192:1:1 forceBlitter=0 useEvents=1)         22.583         22.879          3.73%         20.972         23.745  [GPU]         [GB/s]
                                                            CopyEntireImage(api=ocl size=16384:1:1 forceBlitter=0 useEvents=1)         37.324         38.130          6.29%         32.686         41.943  [GPU]         [GB/s]
                                                            CopyEntireImage(api=ocl size=256:512:1 forceBlitter=0 useEvents=1)        134.067        134.218          1.37%        131.590        137.898  [GPU]         [GB/s]
                                                            CopyEntireImage(api=ocl size=512:512:1 forceBlitter=0 useEvents=1)        188.215        187.769          2.78%        181.375        197.379  [GPU]         [GB/s]
                                                            CopyEntireImage(api=ocl size=512:512:2 forceBlitter=0 useEvents=1)         65.039         65.023          1.30%         63.863         66.665  [GPU]         [GB/s]
                                                           CopyEntireImage(api=ocl size=512:512:64 forceBlitter=0 useEvents=1)         66.013         66.025          0.53%         65.539         66.749  [GPU]         [GB/s]
                                                            CopyImageRegion(api=ocl size=128:128:1 forceBlitter=0 useEvents=1)         34.620         34.714          3.83%         32.686         37.562  [GPU]         [GB/s]
                                                          CopyImageRegion(api=ocl size=128:128:128 forceBlitter=0 useEvents=1)        202.367        202.785          1.18%        199.334        206.622  [GPU]         [GB/s]
                                                        CopyImageRegion(api=ocl size=128:1024:1024 forceBlitter=0 useEvents=1)         85.799         85.767          0.07%         85.718         85.914  [GPU]         [GB/s]
                                                         CopyImageRegion(api=ocl size=1024:16:1024 forceBlitter=0 useEvents=1)        191.744        191.854          0.27%        190.958        192.528  [GPU]         [GB/s]
                                                         CopyImageRegion(api=ocl size=1024:1024:16 forceBlitter=0 useEvents=1)         57.461         57.372          1.02%         56.636         58.666  [GPU]         [GB/s]
                           FillBuffer(api=ocl size=128MB contents=Zeros patternSize=1 compressed=0 forceBlitter=0 useEvents=1)        481.777        482.490          0.39%        478.281        484.032  [GPU]         [GB/s]
                           FillBuffer(api=ocl size=128MB contents=Zeros patternSize=1 compressed=1 forceBlitter=0 useEvents=1)        828.253        828.612          0.39%        823.846        834.516  [GPU]         [GB/s]
                          FillBuffer(api=ocl size=128MB contents=Zeros patternSize=16 compressed=0 forceBlitter=0 useEvents=1)        482.532        482.672          0.30%        479.885        484.577  [GPU]         [GB/s]
                          FillBuffer(api=ocl size=128MB contents=Zeros patternSize=16 compressed=1 forceBlitter=0 useEvents=1)        830.258        832.631          0.72%        818.092        836.685  [GPU]         [GB/s]
                         FillBuffer(api=ocl size=128MB contents=Zeros patternSize=128 compressed=0 forceBlitter=0 useEvents=1)        478.185        483.124          3.36%        430.071        486.224  [GPU]         [GB/s]
                         FillBuffer(api=ocl size=128MB contents=Zeros patternSize=128 compressed=1 forceBlitter=0 useEvents=1)        818.168        833.708          5.79%        677.084        841.601  [GPU]         [GB/s]
                           FillBuffer(api=ocl size=512MB contents=Zeros patternSize=1 compressed=0 forceBlitter=0 useEvents=1)        471.876        471.867          0.16%        470.939        473.579  [GPU]         [GB/s]
                           FillBuffer(api=ocl size=512MB contents=Zeros patternSize=1 compressed=1 forceBlitter=0 useEvents=1)        855.306        855.075          0.46%        847.134        860.571  [GPU]         [GB/s]
                          FillBuffer(api=ocl size=512MB contents=Zeros patternSize=16 compressed=0 forceBlitter=0 useEvents=1)        472.469        472.321          0.13%        471.672        473.536  [GPU]         [GB/s]
                          FillBuffer(api=ocl size=512MB contents=Zeros patternSize=16 compressed=1 forceBlitter=0 useEvents=1)        855.549        854.020          0.63%        849.088        863.889  [GPU]         [GB/s]
                         FillBuffer(api=ocl size=512MB contents=Zeros patternSize=128 compressed=0 forceBlitter=0 useEvents=1)        472.024        471.867          0.22%        470.596        473.710  [GPU]         [GB/s]
                         FillBuffer(api=ocl size=512MB contents=Zeros patternSize=128 compressed=1 forceBlitter=0 useEvents=1)        854.989        857.708          0.69%        844.082        861.867  [GPU]         [GB/s]
                                           MapBuffer(api=ocl size=128MB contents=Zeros compressed=0 mapFlags=Read useEvents=1)         22.164         22.164          0.00%         22.162         22.165  [GPU]         [GB/s]
                                          MapBuffer(api=ocl size=128MB contents=Zeros compressed=0 mapFlags=Write useEvents=1)         22.151         22.163          0.11%         22.101         22.165  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=128MB contents=Zeros compressed=0 mapFlags=WriteInvalidate useEvents=1)            inf    2581110.154            inf    2581110.154            inf  [GPU]         [GB/s]
                                           MapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=Read useEvents=1)         22.150         22.162          0.12%         22.097         22.165  [GPU]         [GB/s]
                                          MapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=Write useEvents=1)         22.143         22.162          0.13%         22.093         22.163  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=WriteInvalidate useEvents=1)            inf            inf            inf    2581110.154            inf  [GPU]         [GB/s]
                                           MapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=Read useEvents=1)         22.400         22.403          0.03%         22.388         22.404  [GPU]         [GB/s]
                                          MapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=Write useEvents=1)         22.402         22.403          0.02%         22.388         22.404  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=WriteInvalidate useEvents=1)            inf   10324440.615            inf   10324440.615            inf  [GPU]         [GB/s]
                                           MapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=Read useEvents=1)         22.401         22.402          0.02%         22.387         22.403  [GPU]         [GB/s]
                                          MapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=Write useEvents=1)         22.401         22.403          0.02%         22.387         22.403  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=WriteInvalidate useEvents=1)            inf   10324440.615            inf   10324440.615            inf  [GPU]         [GB/s]
                                             ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=None)          8.908          8.911          0.08%          8.893          8.915  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=Usm)         21.459         21.351          0.84%         21.281         21.703  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=Map)         21.427         21.422          1.17%         21.101         21.722  [CPU]         [GB/s]
                                             ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=1 reuse=None)          9.100          9.100          0.01%          9.098          9.103  [GPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=1 reuse=Usm)         22.115         22.102          0.11%         22.101         22.166  [GPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=1 reuse=Map)         22.151         22.164          0.12%         22.096         22.167  [GPU]         [GB/s]
                                             ReadBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=0 reuse=None)          8.893          8.893          0.07%          8.880          8.901  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=0 reuse=Usm)         21.396         21.315          1.22%         21.031         21.719  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=0 reuse=Map)         21.197         21.209          0.35%         21.062         21.349  [CPU]         [GB/s]
                                             ReadBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=1 reuse=None)          9.097          9.097          0.01%          9.096          9.099  [GPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=1 reuse=Usm)         22.125         22.102          0.14%         22.099         22.164  [GPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=1 reuse=Map)         22.137         22.158          0.14%         22.098         22.164  [GPU]         [GB/s]
                                             ReadBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=0 reuse=None)          8.926          8.927          0.04%          8.917          8.932  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=0 reuse=Usm)         22.281         22.282          0.02%         22.271         22.287  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=0 reuse=Map)         22.256         22.281          0.23%         22.142         22.288  [CPU]         [GB/s]
                                             ReadBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=1 reuse=None)          9.107          9.108          0.05%          9.092          9.109  [GPU]         [GB/s]
                                              ReadBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=1 reuse=Usm)         22.402         22.404          0.04%         22.387         22.420  [GPU]         [GB/s]
                                              ReadBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=1 reuse=Map)         22.400         22.403          0.04%         22.372         22.404  [GPU]         [GB/s]
                                             ReadBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=0 reuse=None)          8.942          8.942          0.06%          8.932          8.949  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=0 reuse=Usm)         22.209         22.177          0.26%         22.147         22.282  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=0 reuse=Map)         22.261         22.280          0.23%         22.107         22.287  [CPU]         [GB/s]
                                             ReadBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=1 reuse=None)          9.113          9.114          0.03%          9.107          9.116  [GPU]         [GB/s]
                                              ReadBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=1 reuse=Usm)         22.401         22.403          0.02%         22.386         22.404  [GPU]         [GB/s]
                                              ReadBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=1 reuse=Map)         22.402         22.402          0.00%         22.402         22.403  [GPU]         [GB/s]
                                                            ReadBufferMisaligned(api=ocl size=16MB misalignment=0 useEvents=1)         22.002         22.429          2.97%         21.004         22.432  [GPU]         [GB/s]
                                                            ReadBufferMisaligned(api=ocl size=16MB misalignment=1 useEvents=1)          9.038          9.040          1.09%          8.934          9.139  [GPU]         [GB/s]
                                                            ReadBufferMisaligned(api=ocl size=16MB misalignment=2 useEvents=1)          9.001          9.035          1.69%          8.693          9.140  [GPU]         [GB/s]
                                                            ReadBufferMisaligned(api=ocl size=16MB misalignment=4 useEvents=1)          8.958          9.037          2.23%          8.672          9.142  [GPU]         [GB/s]
                           ReadBufferRect(api=ocl size=16KB compressed=0 origin=0:0:0 region=128:128:1 rPitch=128 sPitch=16KB)          0.206          0.207         21.09%          0.150          0.300  [CPU]         [GB/s]
                          ReadBufferRect(api=ocl size=2MB compressed=0 origin=0:0:0 region=128:128:128 rPitch=128 sPitch=16KB)          6.424          6.929         14.59%          4.345          7.025  [CPU]         [GB/s]
                     ReadBufferRect(api=ocl size=128MB compressed=0 origin=0:0:0 region=128:1024:1024 rPitch=128 sPitch=128KB)          0.331          0.331          0.02%          0.331          0.331  [CPU]         [GB/s]
                        ReadBufferRect(api=ocl size=16MB compressed=0 origin=0:0:0 region=1024:16:1024 rPitch=1KB sPitch=16KB)          2.444          2.443          0.43%          2.427          2.472  [CPU]         [GB/s]
                         ReadBufferRect(api=ocl size=16MB compressed=0 origin=0:0:0 region=1024:1024:16 rPitch=1KB sPitch=1MB)          2.470          2.477          0.57%          2.442          2.481  [CPU]         [GB/s]
                           ReadBufferRect(api=ocl size=16KB compressed=1 origin=0:0:0 region=128:128:1 rPitch=128 sPitch=16KB)          0.152          0.153          3.91%          0.137          0.160  [CPU]         [GB/s]
                          ReadBufferRect(api=ocl size=2MB compressed=1 origin=0:0:0 region=128:128:128 rPitch=128 sPitch=16KB)          0.324          0.324          0.16%          0.323          0.325  [CPU]         [GB/s]
                     ReadBufferRect(api=ocl size=128MB compressed=1 origin=0:0:0 region=128:1024:1024 rPitch=128 sPitch=128KB)          0.331          0.331          0.04%          0.330          0.331  [CPU]         [GB/s]
                        ReadBufferRect(api=ocl size=16MB compressed=1 origin=0:0:0 region=1024:16:1024 rPitch=1KB sPitch=16KB)          2.440          2.440          0.23%          2.430          2.448  [CPU]         [GB/s]
                         ReadBufferRect(api=ocl size=16MB compressed=1 origin=0:0:0 region=1024:1024:16 rPitch=1KB sPitch=1MB)          2.474          2.478          0.53%          2.436          2.484  [CPU]         [GB/s]
                                                                           ReadDeviceMemBuffer(api=ocl size=64KB compressed=0)      13437.183      13439.742          0.34%      13368.300      13507.355  [GPU]         [GB/s]
                                                                           ReadDeviceMemBuffer(api=ocl size=64KB compressed=1)      13672.256      13450.944          2.61%      13421.773      14273.428  [GPU]         [GB/s]
                                                                          ReadDeviceMemBuffer(api=ocl size=128KB compressed=0)      13776.222      13493.792          2.87%      13417.308      14278.482  [GPU]         [GB/s]
                                                                          ReadDeviceMemBuffer(api=ocl size=128KB compressed=1)      13684.431      13464.428          2.81%      13350.598      14278.482  [GPU]         [GB/s]
                                                                          ReadDeviceMemBuffer(api=ocl size=512KB compressed=0)       5463.673       5443.469          1.02%       5412.005       5572.286  [GPU]         [GB/s]
                                                                          ReadDeviceMemBuffer(api=ocl size=512KB compressed=1)       5293.085       5294.241          0.23%       5271.021       5317.660  [GPU]         [GB/s]
                                                                            ReadDeviceMemBuffer(api=ocl size=1MB compressed=0)       4663.164       4699.503          2.40%       4327.281       4708.294  [GPU]         [GB/s]
                                                                            ReadDeviceMemBuffer(api=ocl size=1MB compressed=1)       4639.069       4663.044          0.95%       4568.859       4678.209  [GPU]         [GB/s]
                                                                            ReadDeviceMemBuffer(api=ocl size=4MB compressed=0)       4093.028       4092.857          0.82%       4055.736       4128.508  [GPU]         [GB/s]
                                                                            ReadDeviceMemBuffer(api=ocl size=4MB compressed=1)       3955.995       3961.953          1.51%       3794.679       4026.936  [GPU]         [GB/s]
                                                                           ReadDeviceMemBuffer(api=ocl size=16MB compressed=0)       3693.072       3686.458          0.65%       3668.155       3744.567  [GPU]         [GB/s]
                                                                           ReadDeviceMemBuffer(api=ocl size=16MB compressed=1)       3652.765       3642.441          0.57%       3637.009       3698.139  [GPU]         [GB/s]
                                                                           ReadDeviceMemBuffer(api=ocl size=32MB compressed=0)        488.932        489.777          0.33%        484.891        490.496  [GPU]         [GB/s]
                                                                           ReadDeviceMemBuffer(api=ocl size=32MB compressed=1)       3434.536       3418.256          0.74%       3405.391       3466.367  [GPU]         [GB/s]
                                                                           ReadDeviceMemBuffer(api=ocl size=64MB compressed=0)        445.025        445.164          0.10%        443.969        445.497  [GPU]         [GB/s]
                                                                           ReadDeviceMemBuffer(api=ocl size=64MB compressed=1)        558.613        558.221          0.48%        554.726        563.948  [GPU]         [GB/s]
                                                                          ReadDeviceMemBuffer(api=ocl size=128MB compressed=0)        412.281        412.356          0.19%        410.598        413.567  [GPU]         [GB/s]
                                                                          ReadDeviceMemBuffer(api=ocl size=128MB compressed=1)        479.875        480.608          0.61%        474.558        484.173  [GPU]         [GB/s]
                                                                          ReadDeviceMemBuffer(api=ocl size=256MB compressed=0)        396.360        396.627          0.19%        395.103        397.168  [GPU]         [GB/s]
                                                                          ReadDeviceMemBuffer(api=ocl size=256MB compressed=1)        438.145        437.838          0.84%        430.010        443.519  [GPU]         [GB/s]
                                                            SLM_DataAccessLatency(api=ocl size=8KB occupancyDiv=8 direction=0)         12.000         12.000          0.00%         12.000         12.000  [GPU]          [clk]
                                                            SLM_DataAccessLatency(api=ocl size=8KB occupancyDiv=4 direction=0)         12.000         12.000          0.00%         12.000         12.000  [GPU]          [clk]
                                                            SLM_DataAccessLatency(api=ocl size=8KB occupancyDiv=2 direction=0)         12.000         12.000          0.00%         12.000         12.000  [GPU]          [clk]
                                                            SLM_DataAccessLatency(api=ocl size=8KB occupancyDiv=1 direction=0)         12.000         12.000          0.00%         12.000         12.000  [GPU]          [clk]
                                                                   StreamAfterTransfer(api=ocl type=Read size=1MB useEvents=0)         16.399         15.983         19.97%         12.627         21.191  [CPU]         [GB/s]
                                                                   StreamAfterTransfer(api=ocl type=Read size=1MB useEvents=1)        114.526        124.283         33.98%         40.106        162.369  [GPU]         [GB/s]
                                                                  StreamAfterTransfer(api=ocl type=Read size=16MB useEvents=0)         95.517         96.039         26.66%         62.588        156.612  [CPU]         [GB/s]
                                                                  StreamAfterTransfer(api=ocl type=Read size=16MB useEvents=1)        433.429        431.805          2.84%        415.113        449.900  [GPU]         [GB/s]
                                                                  StreamAfterTransfer(api=ocl type=Read size=64MB useEvents=0)        238.962        240.395          5.43%        203.580        253.042  [CPU]         [GB/s]
                                                                  StreamAfterTransfer(api=ocl type=Read size=64MB useEvents=1)        457.094        457.075          1.06%        445.536        462.820  [GPU]         [GB/s]
                                                                 StreamAfterTransfer(api=ocl type=Read size=256MB useEvents=0)        344.262        337.998          3.69%        331.671        372.708  [CPU]         [GB/s]
                                                                 StreamAfterTransfer(api=ocl type=Read size=256MB useEvents=1)        471.753        472.624          0.50%        467.352        474.582  [GPU]         [GB/s]
                                                                   StreamAfterTransfer(api=ocl type=Read size=1GB useEvents=0)        422.639        422.663          0.22%        420.779        424.488  [CPU]         [GB/s]
                                                                   StreamAfterTransfer(api=ocl type=Read size=1GB useEvents=1)        466.997        467.130          0.31%        463.319        468.776  [GPU]         [GB/s]
                                                                  StreamAfterTransfer(api=ocl type=Triad size=1MB useEvents=0)         57.075         57.709          4.33%         50.695         60.240  [CPU]         [GB/s]
                                                                  StreamAfterTransfer(api=ocl type=Triad size=1MB useEvents=1)        308.914        308.163          3.77%        290.384        331.863  [GPU]         [GB/s]
                                                                 StreamAfterTransfer(api=ocl type=Triad size=16MB useEvents=0)        260.031        311.093         29.27%        137.463        315.274  [CPU]         [GB/s]
                                                                 StreamAfterTransfer(api=ocl type=Triad size=16MB useEvents=1)        484.021        484.154          0.51%        479.349        487.573  [GPU]         [GB/s]
                                                                 StreamAfterTransfer(api=ocl type=Triad size=64MB useEvents=0)        319.460        315.321          2.49%        312.822        333.915  [CPU]         [GB/s]
                                                                 StreamAfterTransfer(api=ocl type=Triad size=64MB useEvents=1)        457.751        457.886          0.35%        455.512        459.847  [GPU]         [GB/s]
                                                                StreamAfterTransfer(api=ocl type=Triad size=256MB useEvents=0)        382.815        387.269          3.21%        358.183        397.706  [CPU]         [GB/s]
                                                                StreamAfterTransfer(api=ocl type=Triad size=256MB useEvents=1)        435.971        432.948          0.92%        432.234        441.466  [GPU]         [GB/s]
                                                                  StreamAfterTransfer(api=ocl type=Triad size=1GB useEvents=0)        419.732        419.344          0.43%        418.150        424.971  [CPU]         [GB/s]
                                                                  StreamAfterTransfer(api=ocl type=Triad size=1GB useEvents=1)        441.463        441.428          0.05%        441.157        441.856  [GPU]         [GB/s]
                                                                          StreamMemory(api=ocl type=Read size=1MB useEvents=0)         23.283         27.391         39.79%          1.981         31.480  [CPU]         [GB/s]
                                                                          StreamMemory(api=ocl type=Read size=1MB useEvents=1)        160.757        161.082          5.40%        148.041        179.766  [GPU]         [GB/s]
                                                                          StreamMemory(api=ocl type=Read size=8MB useEvents=0)         36.084         26.171         81.93%         19.913        123.876  [CPU]         [GB/s]
                                                                          StreamMemory(api=ocl type=Read size=8MB useEvents=1)        270.419        221.244         29.38%        211.369        396.718  [GPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Read size=32MB useEvents=0)         89.629         89.361          5.73%         82.069         98.469  [CPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Read size=32MB useEvents=1)        381.713        372.837         14.95%        281.332        469.569  [GPU]         [GB/s]
                                                                        StreamMemory(api=ocl type=Read size=128MB useEvents=0)        225.641        227.071          2.80%        213.023        234.481  [CPU]         [GB/s]
                                                                        StreamMemory(api=ocl type=Read size=128MB useEvents=1)        448.533        443.717          3.05%        430.646        474.408  [GPU]         [GB/s]
                                                                        StreamMemory(api=ocl type=Read size=512MB useEvents=0)        370.034        370.214          0.58%        365.441        373.196  [CPU]         [GB/s]
                                                                        StreamMemory(api=ocl type=Read size=512MB useEvents=1)        469.202        468.180          0.79%        464.363        474.451  [GPU]         [GB/s]
                                                                          StreamMemory(api=ocl type=Read size=1GB useEvents=0)        415.207        416.216          1.42%        402.700        425.647  [CPU]         [GB/s]
                                                                          StreamMemory(api=ocl type=Read size=1GB useEvents=1)        443.681        466.180         15.56%        237.116        474.932  [GPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Write size=1MB useEvents=0)          7.294          3.617         99.53%          2.975         27.214  [CPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Write size=1MB useEvents=1)        236.177        245.568          6.14%        209.715        245.568  [GPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Write size=8MB useEvents=0)         25.033         25.177          7.34%         21.912         28.626  [CPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Write size=8MB useEvents=1)        549.486        631.634         26.44%        267.545        644.286  [GPU]         [GB/s]
                                                                        StreamMemory(api=ocl type=Write size=32MB useEvents=0)         86.074         86.428          9.56%         72.420        102.089  [CPU]         [GB/s]
                                                                        StreamMemory(api=ocl type=Write size=32MB useEvents=1)        452.018        428.926         10.58%        421.078        548.768  [GPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Write size=128MB useEvents=0)        221.394        221.495          1.75%        215.647        226.262  [CPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Write size=128MB useEvents=1)        466.667        458.538          2.82%        456.265        488.991  [GPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Write size=512MB useEvents=0)        365.680        366.239          0.67%        361.459        369.253  [CPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Write size=512MB useEvents=1)        470.849        473.405          0.85%        464.698        475.020  [GPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Write size=1GB useEvents=0)        400.094        406.393          3.65%        368.773        415.120  [CPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Write size=1GB useEvents=1)        465.320        467.904          1.41%        450.876        472.191  [GPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Scale size=1MB useEvents=0)         10.670          6.409        123.41%          5.643         50.159  [CPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Scale size=1MB useEvents=1)        292.954        296.082          6.02%        248.566        314.604  [GPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Scale size=8MB useEvents=0)         50.332         48.029         15.08%         42.353         71.492  [CPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Scale size=8MB useEvents=1)        544.977        422.738         29.47%        401.657        745.654  [GPU]         [GB/s]
                                                                        StreamMemory(api=ocl type=Scale size=32MB useEvents=0)        150.119        146.486          6.62%        137.627        169.391  [CPU]         [GB/s]
                                                                        StreamMemory(api=ocl type=Scale size=32MB useEvents=1)        495.402        471.115          6.84%        461.826        538.219  [GPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Scale size=128MB useEvents=0)        303.772        303.663          1.53%        297.067        309.635  [CPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Scale size=128MB useEvents=1)        466.047        463.361          1.18%        460.505        477.573  [GPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Scale size=512MB useEvents=0)        406.940        408.442          1.32%        397.545        412.645  [CPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Scale size=512MB useEvents=1)        460.613        461.766          0.99%        450.758        465.558  [GPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Scale size=1GB useEvents=0)        416.081        414.837          1.15%        412.166        428.945  [CPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Scale size=1GB useEvents=1)        462.553        462.700          0.13%        461.277        463.527  [GPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Triad size=1MB useEvents=0)         16.782         10.042        120.59%          8.255         77.388  [CPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Triad size=1MB useEvents=1)        316.038        316.242          3.79%        290.384        331.863  [GPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Triad size=8MB useEvents=0)         90.666         71.172         70.27%         52.178        280.693  [CPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Triad size=8MB useEvents=1)        453.174        408.099         15.75%        399.991        569.801  [GPU]         [GB/s]
                                                                        StreamMemory(api=ocl type=Triad size=32MB useEvents=0)        190.023        191.097          2.67%        179.507        198.604  [CPU]         [GB/s]
                                                                        StreamMemory(api=ocl type=Triad size=32MB useEvents=1)        440.310        430.744          3.66%        427.218        465.721  [GPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Triad size=128MB useEvents=0)        331.344        330.421          1.37%        324.909        339.221  [CPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Triad size=128MB useEvents=1)        450.059        451.759          0.83%        442.780        453.109  [GPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Triad size=512MB useEvents=0)        406.178        407.412          1.09%        399.769        412.890  [CPU]         [GB/s]
                                                                       StreamMemory(api=ocl type=Triad size=512MB useEvents=1)        441.730        442.951          0.50%        436.148        443.631  [GPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Triad size=1GB useEvents=0)        422.141        421.923          0.61%        418.274        427.047  [CPU]         [GB/s]
                                                                         StreamMemory(api=ocl type=Triad size=1GB useEvents=1)        445.503        445.616          0.11%        444.358        446.018  [GPU]         [GB/s]
                                         UnmapBuffer(api=ocl size=128MB contents=Zeros compressed=0 mapFlags=Read useEvents=1)      49044.661      49827.170         16.75%      34574.376      66117.107  [CPU]         [GB/s]
                                        UnmapBuffer(api=ocl size=128MB contents=Zeros compressed=0 mapFlags=Write useEvents=1)         23.143         23.298          1.15%         22.730         23.347  [CPU]         [GB/s]
                              UnmapBuffer(api=ocl size=128MB contents=Zeros compressed=0 mapFlags=WriteInvalidate useEvents=1)         22.248         22.113          1.08%         22.050         22.649  [CPU]         [GB/s]
                                         UnmapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=Read useEvents=1)      53768.326      56094.264         22.32%      30311.140      69113.145  [CPU]         [GB/s]
                                        UnmapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=Write useEvents=1)         22.702         22.650          0.76%         22.621         23.216  [CPU]         [GB/s]
                              UnmapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=WriteInvalidate useEvents=1)         22.023         22.016          0.16%         21.966         22.077  [CPU]         [GB/s]
                                         UnmapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=Read useEvents=1)     156986.754     159941.847          8.95%     130055.938     173744.632  [CPU]         [GB/s]
                                        UnmapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=Write useEvents=1)         23.367         23.375          0.15%         23.291         23.403  [CPU]         [GB/s]
                              UnmapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=WriteInvalidate useEvents=1)         23.272         23.272          0.03%         23.260         23.283  [CPU]         [GB/s]
                                         UnmapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=Read useEvents=1)     125962.555     129005.938         16.88%      87154.369     159214.387  [CPU]         [GB/s]
                                        UnmapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=Write useEvents=1)         23.274         23.272          0.04%         23.264         23.287  [CPU]         [GB/s]
                              UnmapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=WriteInvalidate useEvents=1)         23.169         23.170          0.06%         23.146         23.199  [CPU]         [GB/s]
                    UsmCopy(api=ocl src=Device dst=Device size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)        225.379        225.134          0.47%        224.280        227.638  [GPU]         [GB/s]
                      UsmCopy(api=ocl src=Device dst=Host size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)         22.460         22.467          0.05%         22.434         22.467  [GPU]         [GB/s]
                   UsmCopy(api=ocl src=Device dst=non-USM size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)          9.115          9.115          0.01%          9.114          9.116  [GPU]         [GB/s]
                      UsmCopy(api=ocl src=Host dst=Device size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)         23.602         23.602          0.02%         23.593         23.610  [GPU]         [GB/s]
                        UsmCopy(api=ocl src=Host dst=Host size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)         15.018         15.017          0.02%         15.015         15.025  [GPU]         [GB/s]
                     UsmCopy(api=ocl src=Host dst=non-USM size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)          8.264          8.264          0.04%          8.259          8.268  [GPU]         [GB/s]
                    UsmCopy(api=ocl src=Shared dst=Device size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)        227.263        227.553          0.51%        225.616        229.126  [GPU]         [GB/s]
                      UsmCopy(api=ocl src=Shared dst=Host size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)         22.467         22.467          0.00%         22.466         22.468  [GPU]         [GB/s]
                   UsmCopy(api=ocl src=Shared dst=non-USM size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)          9.113          9.113          0.02%          9.109          9.114  [GPU]         [GB/s]
                   UsmCopy(api=ocl src=non-USM dst=Device size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)         22.838         22.836          0.08%         22.809         22.877  [GPU]         [GB/s]
                     UsmCopy(api=ocl src=non-USM dst=Host size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)         14.787         14.786          0.03%         14.778         14.796  [GPU]         [GB/s]
                  UsmCopy(api=ocl src=non-USM dst=non-USM size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)          8.196          8.195          0.05%          8.190          8.204  [GPU]         [GB/s]
            UsmCopy(api=ocl src=non-USM-mapped dst=Device size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)         23.599         23.601          0.04%         23.573         23.607  [GPU]         [GB/s]
              UsmCopy(api=ocl src=non-USM-mapped dst=Host size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)         15.013         15.012          0.02%         15.009         15.020  [GPU]         [GB/s]
           UsmCopy(api=ocl src=non-USM-mapped dst=non-USM size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0)          8.289          8.288          0.05%          8.284          8.294  [GPU]         [GB/s]
                                             UsmCopyMultipleBlits(api=ocl src=Device dst=Device size=512MB blitters=000000001)        228.487        228.638          0.24%        227.503        229.442  [GPU]     BCS [GB/s]
                                                                                                                                      219.367        219.357          0.34%        217.917        220.639  [CPU] Total (Cpu) [GB/s]
                                                                                                                                      228.487        228.638          0.24%        227.503        229.442  [GPU] Total (Gpu) [GB/s]
                                               UsmCopyMultipleBlits(api=ocl src=Device dst=Host size=512MB blitters=000000001)         22.467         22.467          0.00%         22.467         22.468  [GPU]     BCS [GB/s]
                                                                                                                                       22.371         22.373          0.03%         22.360         22.380  [CPU] Total (Cpu) [GB/s]
                                                                                                                                       22.467         22.467          0.00%         22.467         22.468  [GPU] Total (Gpu) [GB/s]
                                               UsmCopyMultipleBlits(api=ocl src=Host dst=Device size=512MB blitters=000000001)         23.611         23.611          0.02%         23.606         23.618  [GPU]     BCS [GB/s]
                                                                                                                                       23.508         23.507          0.01%         23.500         23.512  [CPU] Total (Cpu) [GB/s]
                                                                                                                                       23.611         23.611          0.02%         23.606         23.618  [GPU] Total (Gpu) [GB/s]
                                                 UsmCopyMultipleBlits(api=ocl src=Host dst=Host size=512MB blitters=000000001)         15.016         15.016          0.02%         15.012         15.020  [GPU]     BCS [GB/s]
                                                                                                                                       14.974         14.974          0.02%         14.967         14.977  [CPU] Total (Cpu) [GB/s]
                                                                                                                                       15.016         15.016          0.02%         15.012         15.020  [GPU] Total (Gpu) [GB/s]
        UsmFill(api=ocl memory=Host size=128MB contents=Zeros patternSize=1 patternContents=Random forceBlitter=0 useEvents=1)         25.605         25.622          0.14%         25.527         25.629  [GPU]         [GB/s]
        UsmFill(api=ocl memory=Host size=128MB contents=Zeros patternSize=4 patternContents=Random forceBlitter=0 useEvents=1)         25.561         25.622          0.71%         25.015         25.632  [GPU]         [GB/s]
       UsmFill(api=ocl memory=Host size=128MB contents=Zeros patternSize=16 patternContents=Random forceBlitter=0 useEvents=1)         25.609         25.621          0.11%         25.550         25.634  [GPU]         [GB/s]
        UsmFill(api=ocl memory=Host size=512MB contents=Zeros patternSize=1 patternContents=Random forceBlitter=0 useEvents=1)         23.193         23.194          0.02%         23.184         23.198  [GPU]         [GB/s]
        UsmFill(api=ocl memory=Host size=512MB contents=Zeros patternSize=4 patternContents=Random forceBlitter=0 useEvents=1)         23.191         23.192          0.02%         23.184         23.197  [GPU]         [GB/s]
       UsmFill(api=ocl memory=Host size=512MB contents=Zeros patternSize=16 patternContents=Random forceBlitter=0 useEvents=1)         23.194         23.194          0.02%         23.183         23.197  [GPU]         [GB/s]
      UsmFill(api=ocl memory=Device size=128MB contents=Zeros patternSize=1 patternContents=Random forceBlitter=0 useEvents=1)        482.868        483.759          0.56%        478.104        487.143  [GPU]         [GB/s]
      UsmFill(api=ocl memory=Device size=128MB contents=Zeros patternSize=4 patternContents=Random forceBlitter=0 useEvents=1)        483.310        483.396          0.32%        480.421        486.407  [GPU]         [GB/s]
     UsmFill(api=ocl memory=Device size=128MB contents=Zeros patternSize=16 patternContents=Random forceBlitter=0 useEvents=1)        477.980        483.307          3.34%        430.214        484.395  [GPU]         [GB/s]
      UsmFill(api=ocl memory=Device size=512MB contents=Zeros patternSize=1 patternContents=Random forceBlitter=0 useEvents=1)        470.968        472.213          0.83%        459.478        473.579  [GPU]         [GB/s]
      UsmFill(api=ocl memory=Device size=512MB contents=Zeros patternSize=4 patternContents=Random forceBlitter=0 useEvents=1)        472.691        472.602          0.19%        471.026        473.971  [GPU]         [GB/s]
     UsmFill(api=ocl memory=Device size=512MB contents=Zeros patternSize=16 patternContents=Random forceBlitter=0 useEvents=1)        472.819        472.776          0.10%        471.672        473.362  [GPU]         [GB/s]
      UsmFill(api=ocl memory=Shared size=128MB contents=Zeros patternSize=1 patternContents=Random forceBlitter=0 useEvents=1)        482.241        482.672          0.23%        480.243        483.667  [GPU]         [GB/s]
      UsmFill(api=ocl memory=Shared size=128MB contents=Zeros patternSize=4 patternContents=Random forceBlitter=0 useEvents=1)        481.579        481.591          0.39%        479.171        484.032  [GPU]         [GB/s]
     UsmFill(api=ocl memory=Shared size=128MB contents=Zeros patternSize=16 patternContents=Random forceBlitter=0 useEvents=1)        482.369        482.220          0.30%        480.602        485.673  [GPU]         [GB/s]
      UsmFill(api=ocl memory=Shared size=512MB contents=Zeros patternSize=1 patternContents=Random forceBlitter=0 useEvents=1)        471.871        472.213          0.43%        466.127        473.797  [GPU]         [GB/s]
      UsmFill(api=ocl memory=Shared size=512MB contents=Zeros patternSize=4 patternContents=Random forceBlitter=0 useEvents=1)        470.363        472.104          0.90%        459.560        473.492  [GPU]         [GB/s]
     UsmFill(api=ocl memory=Shared size=512MB contents=Zeros patternSize=16 patternContents=Random forceBlitter=0 useEvents=1)        471.894        472.277          0.46%        465.579        473.579  [GPU]         [GB/s]
             UsmFillSpecificPattern(api=ocl memory=Host size=128MB contents=Zeros pattern=0x01AA0B forceBlitter=0 useEvents=1)         25.599         25.612          0.19%         25.459         25.630  [GPU]         [GB/s]
             UsmFillSpecificPattern(api=ocl memory=Host size=512MB contents=Zeros pattern=0x01AA0B forceBlitter=0 useEvents=1)         23.193         23.193          0.02%         23.185         23.197  [GPU]         [GB/s]
           UsmFillSpecificPattern(api=ocl memory=Device size=128MB contents=Zeros pattern=0x01AA0B forceBlitter=0 useEvents=1)        483.217        483.577          0.21%        481.140        484.759  [GPU]         [GB/s]
           UsmFillSpecificPattern(api=ocl memory=Device size=512MB contents=Zeros pattern=0x01AA0B forceBlitter=0 useEvents=1)        470.693        471.931          0.89%        459.028        475.107  [GPU]         [GB/s]
           UsmFillSpecificPattern(api=ocl memory=Shared size=128MB contents=Zeros pattern=0x01AA0B forceBlitter=0 useEvents=1)        482.011        482.130          0.38%        478.104        484.214  [GPU]         [GB/s]
           UsmFillSpecificPattern(api=ocl memory=Shared size=512MB contents=Zeros pattern=0x01AA0B forceBlitter=0 useEvents=1)        472.682        472.927          0.20%        471.112        474.364  [GPU]         [GB/s]
                                           UsmMemset(api=ocl memory=Host size=128MB contents=Zeros forceBlitter=0 useEvents=0)         21.151         21.022          1.54%         20.823         22.029  [CPU]         [GB/s]
                                           UsmMemset(api=ocl memory=Host size=512MB contents=Zeros forceBlitter=0 useEvents=0)         22.068         22.092          0.37%         21.921         22.165  [CPU]         [GB/s]
                                         UsmMemset(api=ocl memory=Device size=128MB contents=Zeros forceBlitter=0 useEvents=0)        374.750        376.363          1.78%        356.238        382.403  [CPU]         [GB/s]
                                         UsmMemset(api=ocl memory=Device size=512MB contents=Zeros forceBlitter=0 useEvents=0)        423.066        422.353          0.57%        419.965        428.657  [CPU]         [GB/s]
                                         UsmMemset(api=ocl memory=Shared size=128MB contents=Zeros forceBlitter=0 useEvents=0)        365.479        373.894          4.37%        330.877        376.148  [CPU]         [GB/s]
                                         UsmMemset(api=ocl memory=Shared size=512MB contents=Zeros forceBlitter=0 useEvents=0)        395.557        400.138          4.68%        367.090        418.676  [CPU]         [GB/s]
                                                                      UsmSharedMigrateCpu(api=ocl accessAllBytes=0 size=128MB)         21.457         21.538          0.73%         21.240         21.613  [CPU]         [GB/s]
                                                                      UsmSharedMigrateCpu(api=ocl accessAllBytes=0 size=256MB)         21.684         21.694          0.10%         21.629         21.696  [CPU]         [GB/s]
                                                                      UsmSharedMigrateCpu(api=ocl accessAllBytes=1 size=128MB)          3.247          3.274          4.64%          3.046          3.436  [CPU]         [GB/s]
                                                                      UsmSharedMigrateCpu(api=ocl accessAllBytes=1 size=256MB)          3.409          3.274         10.72%          3.086          4.330  [CPU]         [GB/s]
                                                                            UsmSharedMigrateGpu(api=ocl size=128MB prefetch=0)         19.570         19.691          2.05%         18.970         20.014  [CPU]         [GB/s]
                                                                            UsmSharedMigrateGpu(api=ocl size=128MB prefetch=1)         19.206         19.232          1.81%         18.677         19.689  [CPU]         [GB/s]
                                                                            UsmSharedMigrateGpu(api=ocl size=256MB prefetch=0)         17.644         18.156          7.07%         15.055         18.956  [CPU]         [GB/s]
                                                                            UsmSharedMigrateGpu(api=ocl size=256MB prefetch=1)         16.080         15.733          6.10%         15.662         19.015  [CPU]         [GB/s]
                                                      UsmSharedMigrateGpuForFill(api=ocl size=128MB prefetch=0 forceBlitter=0)         17.633         17.466          2.34%         17.096         18.486  [CPU]         [GB/s]
                                                      UsmSharedMigrateGpuForFill(api=ocl size=128MB prefetch=0 forceBlitter=1)                                        ERROR
                                                      UsmSharedMigrateGpuForFill(api=ocl size=128MB prefetch=1 forceBlitter=0)         17.184         17.682          5.81%         15.047         18.045  [CPU]         [GB/s]
                                                      UsmSharedMigrateGpuForFill(api=ocl size=128MB prefetch=1 forceBlitter=1)                                        ERROR
                                                      UsmSharedMigrateGpuForFill(api=ocl size=256MB prefetch=0 forceBlitter=0)         20.771         20.856          0.70%         20.582         20.918  [CPU]         [GB/s]
                                                      UsmSharedMigrateGpuForFill(api=ocl size=256MB prefetch=0 forceBlitter=1)                                        ERROR
                                                      UsmSharedMigrateGpuForFill(api=ocl size=256MB prefetch=1 forceBlitter=0)         20.809         20.856          0.57%         20.570         20.921  [CPU]         [GB/s]
                                                      UsmSharedMigrateGpuForFill(api=ocl size=256MB prefetch=1 forceBlitter=1)                                        ERROR
                                            WriteBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=None)         21.459         21.456          0.32%         21.293         21.549  [CPU]         [GB/s]
                                             WriteBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=Usm)         23.372         23.504          1.68%         22.197         23.537  [CPU]         [GB/s]
                                             WriteBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=Map)         23.370         23.509          1.59%         22.257         23.521  [CPU]         [GB/s]
                                            WriteBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=1 reuse=None)         22.690         22.680          0.19%         22.617         22.771  [GPU]         [GB/s]
                                             WriteBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=1 reuse=Usm)         23.520         23.558          0.45%         23.202         23.568  [GPU]         [GB/s]
                                             WriteBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=1 reuse=Map)         23.507         23.542          0.45%         23.191         23.551  [GPU]         [GB/s]
                                            WriteBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=0 reuse=None)         21.512         21.520          0.21%         21.437         21.568  [CPU]         [GB/s]
                                             WriteBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=0 reuse=Usm)         23.326         23.459          1.57%         22.232         23.475  [CPU]         [GB/s]
                                             WriteBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=0 reuse=Map)         23.270         23.409          1.65%         22.124         23.437  [CPU]         [GB/s]
                                            WriteBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=1 reuse=None)         22.690         22.681          0.12%         22.644         22.746  [GPU]         [GB/s]
                                             WriteBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=1 reuse=Usm)         23.396         23.429          0.45%         23.081         23.453  [GPU]         [GB/s]
                                             WriteBuffer(api=ocl size=128MB contents=Zeros compressed=1 useEvents=1 reuse=Map)         23.401         23.435          0.48%         23.065         23.457  [GPU]         [GB/s]
                                            WriteBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=0 reuse=None)         21.807         21.822          0.36%         21.591         21.874  [CPU]         [GB/s]
                                             WriteBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=0 reuse=Usm)         23.578         23.599          0.25%         23.406         23.609  [CPU]         [GB/s]
                                             WriteBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=0 reuse=Map)         23.572         23.595          0.25%         23.400         23.598  [CPU]         [GB/s]
                                            WriteBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=1 reuse=None)         22.841         22.843          0.08%         22.802         22.869  [GPU]         [GB/s]
                                             WriteBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=1 reuse=Usm)         23.603         23.611          0.09%         23.539         23.614  [GPU]         [GB/s]
                                             WriteBuffer(api=ocl size=512MB contents=Zeros compressed=0 useEvents=1 reuse=Map)         23.595         23.603          0.10%         23.526         23.605  [GPU]         [GB/s]
                                            WriteBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=0 reuse=None)         21.847         21.860          0.26%         21.693         21.903  [CPU]         [GB/s]
                                             WriteBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=0 reuse=Usm)         23.513         23.533          0.25%         23.339         23.547  [CPU]         [GB/s]
                                             WriteBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=0 reuse=Map)         23.439         23.470          0.39%         23.164         23.481  [CPU]         [GB/s]
                                            WriteBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=1 reuse=None)         22.858         22.858          0.04%         22.840         22.870  [GPU]         [GB/s]
                                             WriteBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=1 reuse=Usm)         23.567         23.574          0.09%         23.501         23.580  [GPU]         [GB/s]
                                             WriteBuffer(api=ocl size=512MB contents=Zeros compressed=1 useEvents=1 reuse=Map)         23.489         23.496          0.10%         23.423         23.505  [GPU]         [GB/s]
                          WriteBufferRect(api=ocl size=16KB compressed=0 origin=0:0:0 region=128:128:1 rPitch=128 sPitch=16KB)          0.258          0.260          6.25%          0.223          0.284  [CPU]         [GB/s]
                         WriteBufferRect(api=ocl size=2MB compressed=0 origin=0:0:0 region=128:128:128 rPitch=128 sPitch=16KB)          2.044          2.038          3.73%          1.951          2.164  [CPU]         [GB/s]
                    WriteBufferRect(api=ocl size=128MB compressed=0 origin=0:0:0 region=128:1024:1024 rPitch=128 sPitch=128KB)          2.575          2.575          0.03%          2.573          2.576  [CPU]         [GB/s]
                       WriteBufferRect(api=ocl size=16MB compressed=0 origin=0:0:0 region=1024:16:1024 rPitch=1KB sPitch=16KB)          5.573          5.570          0.70%          5.504          5.622  [CPU]         [GB/s]
                        WriteBufferRect(api=ocl size=16MB compressed=0 origin=0:0:0 region=1024:1024:16 rPitch=1KB sPitch=1MB)         15.915         15.908          5.99%         14.716         17.187  [CPU]         [GB/s]
                          WriteBufferRect(api=ocl size=16KB compressed=1 origin=0:0:0 region=128:128:1 rPitch=128 sPitch=16KB)          0.260          0.261          2.48%          0.246          0.271  [CPU]         [GB/s]
                         WriteBufferRect(api=ocl size=2MB compressed=1 origin=0:0:0 region=128:128:128 rPitch=128 sPitch=16KB)          2.006          2.047          6.30%          1.849          2.161  [CPU]         [GB/s]
                    WriteBufferRect(api=ocl size=128MB compressed=1 origin=0:0:0 region=128:1024:1024 rPitch=128 sPitch=128KB)          2.576          2.577          0.10%          2.569          2.579  [CPU]         [GB/s]
                       WriteBufferRect(api=ocl size=16MB compressed=1 origin=0:0:0 region=1024:16:1024 rPitch=1KB sPitch=16KB)          5.552          5.558          0.39%          5.517          5.578  [CPU]         [GB/s]
                        WriteBufferRect(api=ocl size=16MB compressed=1 origin=0:0:0 region=1024:1024:16 rPitch=1KB sPitch=1MB)         15.639         15.638          6.42%         14.362         17.157  [CPU]         [GB/s]

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/1
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /workspace/neo/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/3
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /workspace/neo/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/5
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /workspace/neo/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

[  FAILED  ] UsmSharedMigrateGpuForFillTest/UsmSharedMigrateGpuForFillTest.Test/7
FAILED assertion ASSERT_CL_SUCCESS(clEnqueueMemFillINTEL(opencl.commandQueue, buffer, &pattern, 1, arguments.bufferSize, 0, nullptr, nullptr))
	value: -59 (CL_INVALID_OPERATION)
	Location: /workspace/neo/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/ocl/usm_shared_migrate_gpu_for_fill_ocl.cpp:48

@al42and
Copy link
Contributor

al42and commented Mar 3, 2023

I'm also observing some failures on 6.2.1 kernel and A770, logs attached.

The performance drop for non-D2D copies is even worse than reported by others. Here's excerpt from memory_benchmark_l0.txt for UsmCopy(api=l0 src=* dst=* size=512MB contents=Zeros forceBlitter=0 useEvents=1 reuseCmdList=0):

                      TestCase       Mean     Median   StdDev      Min        Max   Type   Label [unit]
UsmCopy(src=Device dst=Device)    225.207    224.840    0.34%  224.298    226.493  [GPU]         [GB/s]
  UsmCopy(src=Device dst=Host)      2.845      2.848    0.28%    2.822      2.849  [GPU]         [GB/s]
  UsmCopy(src=Host dst=Device)      2.920      2.921    0.05%    2.916      2.922  [GPU]         [GB/s]
    UsmCopy(src=Host dst=Host)      2.071      2.071    0.03%    2.070      2.072  [GPU]         [GB/s]

@BA8F0D39
Copy link
Author

BA8F0D39 commented Mar 9, 2023

@BartusW
Did you manage to replicate the problem on your end, or do you need more information?

@JablonskiMateusz
Copy link
Contributor

With recent release it works fine on our side, @BA8F0D39 could you confirm it is working for you?

@JablonskiMateusz
Copy link
Contributor

@BA8F0D39 Please re-open the issue or create a new one if problem still exists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants