Too low 4GB allocation limit on Intel Arc GPUs (CL_DEVICE_MAX_MEM_ALLOC_SIZE) #627

ProjectPhysX · 2023-03-14T09:31:31Z

The CL_DEVICE_MAX_MEM_ALLOC_SIZE on Intel Arc GPUs is currently set to 4GB (A770 16GB) and 3.86GB (A750). Trying to allocate larger buffers makes the cl::Buffer constructor return error -61. Disabling the error by setting buffer flag (1<<23) during allocation turns compute results into nonsense when the buffer size is larger than 4GB.

This is likely related to 32-bit integer overflow in index calculation.

A 4GB limit on buffer allocation is not contemporary in 2023, especially on a 16GB GPU; it's not 2003 anymore where computers were limited to 32-bit. A lot of software needs to be able to allocate larger buffers in order to fully use the available VRAM capacity. FluidX3D for example needs up to 82% of VRAM in a single buffer; if the allocation limit is 25% on the A770 16GB, only 4.9GB of the 16GB can be used by the software.

The limit should be removed altogether, by setting CL_DEVICE_MAX_MEM_ALLOC_SIZE = CL_DEVICE_GLOBAL_MEM_SIZE = 100% of physical VRAM capacity, and making sure that array indices are computed with 64-bit integers. Nvidia and AMD both allow full VRAM allocation in a single buffer for a long time already.

The text was updated successfully, but these errors were encountered:

jinz2014 · 2023-03-14T16:04:33Z

What is the Linux kernel installed ?

ProjectPhysX · 2023-03-14T17:33:51Z

@jinz2014 kernel 6.2.0-060200-generic, on Ubuntu 22.04.

jinz2014 · 2023-03-14T18:05:44Z

#617
I think developers are aware of the issue. Look forward to the solution in the future.

BA8F0D39 · 2023-03-15T01:04:41Z

Not an intel employee but I found many bugs related to memory transfer

Linux kernel driver causes random memory corruption even if you allocate memory less than 4GB in A770
The blocking memory transfer functions in compute runtime unblocks prematurely and returns invalid data because the memory transfer hasn't completed yet
Have more than one independent memory transfer sometimes cause memory corruption. E.g. copying A to B at the same time as copying C to D causes corruption.
Reading VRAM is 2x slower than writing data into VRAM. No other GPU has this
Segment fault if you transfer GPU data from one CPU thread to another CPU thread.
Compute runtime preventing you from allocating all of the VRAM. Vulkan and OpenGL allows you to allocate all of the VRAM for some reason. Gaming under Linux is unaffected by most of the bugs.

Should we file a bug at https://bugzilla.kernel.org ?

jinz2014 · 2023-03-15T13:41:27Z

@BA8F0D39 Thank you for your summary. Are there links/reproducers available for each item in your list ?

BA8F0D39 · 2023-03-21T08:09:30Z

@jinz2014

If you run compute benchmarks https://github.com/intel/compute-benchmarks , memory transfer benchmarks have infinite memory bandwidth, which is impossible. Therefore, some of the blocking memory functions prematurely unblock without completing their transfer first causing the benchmarks to show inifinite bandwidth.

                                          MapBuffer(api=ocl size=128MB contents=Zeros compressed=0 mapFlags=Write useEvents=1)         22.151         22.163          0.11%         22.101         22.165  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=128MB contents=Zeros compressed=0 mapFlags=WriteInvalidate useEvents=1)            inf    2581110.154            inf    2581110.154            inf  [GPU]         [GB/s]
                                           MapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=Read useEvents=1)         22.150         22.162          0.12%         22.097         22.165  [GPU]         [GB/s]
                                          MapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=Write useEvents=1)         22.143         22.162          0.13%         22.093         22.163  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=128MB contents=Zeros compressed=1 mapFlags=WriteInvalidate useEvents=1)            inf            inf            inf    2581110.154            inf  [GPU]         [GB/s]
                                           MapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=Read useEvents=1)         22.400         22.403          0.03%         22.388         22.404  [GPU]         [GB/s]
                                          MapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=Write useEvents=1)         22.402         22.403          0.02%         22.388         22.404  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=512MB contents=Zeros compressed=0 mapFlags=WriteInvalidate useEvents=1)            inf   10324440.615            inf   10324440.615            inf  [GPU]         [GB/s]
                                           MapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=Read useEvents=1)         22.401         22.402          0.02%         22.387         22.403  [GPU]         [GB/s]
                                          MapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=Write useEvents=1)         22.401         22.403          0.02%         22.387         22.403  [GPU]         [GB/s]
                                MapBuffer(api=ocl size=512MB contents=Zeros compressed=1 mapFlags=WriteInvalidate useEvents=1)            inf   10324440.615            inf   10324440.615            inf  [GPU]         [GB/s]
                                             ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=None)          8.908          8.911          0.08%          8.893          8.915  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=Usm)         21.459         21.351          0.84%         21.281         21.703  [CPU]         [GB/s]
                                              ReadBuffer(api=ocl size=128MB contents=Zeros compressed=0 useEvents=0 reuse=Map)         21.427         21.422          1.17%         21.101         21.722  [CPU]

Having multiple memory transfers at the same time in the compute benchmarks also corrupts memory.

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/49
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/51
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/53
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/55
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/57
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/59
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/61
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/63
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/65
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/67
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/69
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/71
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/73
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/75
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/77
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/79
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/81
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/83
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/85
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/87
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/89
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/91
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/93
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] StreamMemoryTest/StreamMemoryTest.Test/95
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/stream_memory_l0.cpp:173

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/72
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/76
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/80
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/84
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/88
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/92
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/96
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/100
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/104
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/108
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/112
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/116
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/120
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/124
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

[  FAILED  ] UsmCopyTest/UsmCopyTest.Test/128
FAILED assertion ASSERT_ZE_RESULT_SUCCESS(zeEventQueryKernelTimestamp(event, &timestampResult))
	value: 1 (ZE_RESULT_NOT_READY)
	Location: /opt/test/compute-benchmarks/compute-benchmarks/source/benchmarks/memory_benchmark/implementations/l0/usm_copy_l0.cpp:101

clpeak shows that reading is 2x slower than writing to VRAM. No other GPU has this

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 21.64
      enqueueReadBuffer               : 8.92
      enqueueWriteBuffer non-blocking : 22.81
      enqueueReadBuffer non-blocking  : 9.10
      enqueueMapBuffer(for read)      : 20.58
        memcpy from mapped ptr        : 22.62
      enqueueUnmap(after write)       : 23.62
        memcpy to mapped ptr          : 22.44

If you run the compute benchmark multhread tests, they all fail if they use more than one CPU thread.

                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=1 workgroupsPerProcess=1 synchronize=0)                                        ERROR
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=1 workgroupsPerProcess=1 synchronize=1)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=1 workgroupsPerProcess=300 synchronize=0)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=1 workgroupsPerProcess=300 synchronize=1)                                        ERROR
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=2 workgroupsPerProcess=1 synchronize=0)                                        ERROR
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=2 workgroupsPerProcess=1 synchronize=1)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=2 workgroupsPerProcess=300 synchronize=0)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=2 workgroupsPerProcess=300 synchronize=1)                                        ERROR
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=4 workgroupsPerProcess=1 synchronize=0)                                        ERROR
                            MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=4 workgroupsPerProcess=1 synchronize=1)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=4 workgroupsPerProcess=300 synchronize=0)                                        ERROR
                          MultiProcessComputeSharedBuffer(api=l0 tiles=Tile0 processesPerTile=4 workgroupsPerProcess=300 synchronize=1)                                        ERROR

MichalMrozek · 2023-03-21T10:38:19Z

If you run compute benchmarks https://github.com/intel/compute-benchmarks , memory transfer benchmarks have infinite memory bandwidth, which is impossible. Therefore, some of the blocking memory functions prematurely unblock without completing their transfer first causing the benchmarks to show inifinite bandwidth.

Those MapBuffer test have WriteInvalidate flags, which means it doesn't need to do memory transfer at all, as host will be overwriting the contents. That's why reported value is high, as there is no transfer, so it is very short in time. This shows that driver properly optimizes those API calls.

FreddieWitherden · 2023-04-03T14:07:00Z

The limit should be removed altogether, by setting CL_DEVICE_MAX_MEM_ALLOC_SIZE = CL_DEVICE_GLOBAL_MEM_SIZE = 100% of physical VRAM capacity, and making sure that array indices are computed with 64-bit integers. Nvidia and AMD both allow full VRAM allocation in a single buffer for a long time already.

I am unsure how practical this is. A lot of optimizations in the compiler are based around the buffer size being limited to 4 GiB. From what I can gather from the ISA a lot of the memory instructions only support base + 32-bit offset calculations. If larger allocations are permitted the compiler will not be able to emit these instructions and will have to use those which accept 64-bit addresses. This requires extra registers in the kernel and emulated 64-bit arithmetic.

iamhumanipromise · 2023-04-12T02:39:14Z

I have a laptop with 9th Gen Core i9 (9th Gen gfx also), 32GB RAM. Random Discord convos have talked about using it for Stable Diffusion. That being said, the 4GB limit seems to apply there as well.

So is this some sort of carryover from GFX8/GFX9 cards? Change the allocation size when using dedicated GPU vs. iGPU?

MaciejPlewka · 2023-04-13T15:43:53Z

It's possible to use allocations greater than 4GB. Please take a look at this guide https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md

iamhumanipromise · 2023-04-14T13:37:06Z

That's the programmer's guide. I'm specifically talking about after the python environment has been launched and I'm executing already-generated code without this flag enabled.

Looks like I have to file an issue with that project, in this case: it seems as it has been closed, there is no way to to make this work without the dev modifying it? (Or forking, modifying, etc)

ProjectPhysX · 2023-04-15T10:05:17Z

It's possible to use allocations greater than 4GB. Please take a look at this guide https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md

I just tried it again with my A750 and the latest driver 22.49.25018.23 and kernel 6.2.11-060211-generic on Ubuntu 22.04.2 LTS, and it is still broken. Passing the CL_MEM_ALLOW_UNRESTRICTED_SIZE_INTEL = (1 << 23) to cl::Buffer constructor / clCreateBuffer disables the buffer allocation error -61, but simulation results then become nonsense.

To reproduce:

git clone https://github.com/ProjectPhysX/FluidX3D.git
cd FluidX3D
chmod +x make.sh

Change src/opencl.hpp line 213 to

device_buffer = cl::Buffer(device.get_cl_context(), CL_MEM_READ_WRITE|(1<<23), capacity(), nullptr, &error);

Set the benchmark grid resolution in src/setup.cpp from 256³ to 384³ by commenting line 1137 and uncommenting line 1138.

Compile and run with:

./make.sh

If it's broken, it will show impossibly large performance/bandwidth, like

| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|   28144 |   4306 GB/s |       497 |         6281  10% |                  0s |

If it works, reported bandwidth will be realistic, like:

| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    2500 |    382 GB/s |       149 |         9994  40% |                  0s |

bashbaug · 2023-04-16T07:24:26Z

Hi @ProjectPhysX, please note that you fundamentally need two changes to make >4GB allocations "work":

You need to relax allocation limits using CL_MEM_ALLOW_UNRESTRICTED_SIZE_INTEL. It looks like you are doing this in the steps above and it seems like this is working because you are no longer seeing the CL_INVALID_BUFFER_SIZE allocation error.
You need to tell the compiler that you are using >4GB allocations by passing the -cl-intel-greater-than-4GB-buffer-required program build option. This isn't mentioned in the steps above, so I'm guessing it's missing and this is what is causing the nonsense simulation results. Note, this isn't free, which is why this option is not enabled by default.

If you want to play around with >4GB allocations without hacking around in your code, please consider trying the OpenCL Intercept layer and specifically the RelaxAllocationLimits control, which will automatically do both of these steps for you.

ProjectPhysX · 2023-04-16T08:47:42Z

You need to tell the compiler that you are using >4GB allocations by passing the -cl-intel-greater-than-4GB-buffer-required program build option. This isn't mentioned in the steps above, so I'm guessing it's missing and this is what is causing the nonsense simulation results. Note, this isn't free, which is why this option is not enabled by default.

Thanks, it works! I totally missed that build flag. Would be better to enable above 4GB allocations by default though.

ElliottDyson · 2024-01-12T11:05:31Z

Any updates on enabling this by default?

I understand there seems to be some performance compromises if this is done? Just out of curiosity, are these being worked on?

If this has found to no be possible, could a method be implemented so that instead of erroring it gets the CPU to send over 4GB chunks at a time, then the remaining amount, so that greater than 4GB can be sent in "one go" still, whether it be for PyTorch, or any other similar applications that need to make use of greater than 4GB of allocation?

I understand that custom compiling of say, PyTorch, is something that can be done as you have suggested with custom compiler flags. However, some of us aren't developers which makes that quite a hurdle.

Thank you

dbenedb · 2024-01-21T15:12:15Z

I'm seconding @ElliottDyson request. 4GB allocation limit makes Intel's GPUs like ARC A770 useless for any Stable Diffusion work.

ElliottDyson · 2024-01-21T16:38:45Z

I'm seconding @ElliottDyson request. 4GB allocation limit makes Intel's GPUs like ARC A770 useless for any Stable Diffusion work.

I'm not sure how visible this thread is since it's been closed. Perhaps we should be opening a new one that references this? Not sure what typical GitHub etiquette is about something like this though, which is why I haven't done so yet.

lorenzeszz · 2024-09-24T21:32:02Z

With Intel UHD it is only 1.86GB on a 32GB RAM system. OpenVINO GPU Plugin is useless for me.

simonlui · 2024-10-25T17:19:48Z

I'm going to add another usecase that I have hit the wall with that indisputably needs a larger buffer size than this 4GB limit that has been enforced. Video diffusion has now started to hit its stride with bigger models coming out like Mochi but even when quantized to 8 bit, I can not generate more than 7 frames with ComfyUI and a corresponding wrapper plugin because the generated data is overflowing some memory somewhere and I either get a PI_ERROR_OUT_OF_HOST_MEMORY or UNKNOWN PI error when trying to run settings that a 4060 Ti with 16 GB of VRAM is able to handle.

BA8F0D39 mentioned this issue Apr 8, 2023

Arrays larger than 4 GB crashes intel/intel-extension-for-pytorch#325

Open

JablonskiMateusz added question merged change was merged labels Apr 14, 2023

JablonskiMateusz closed this as completed Apr 14, 2023

iamhumanipromise mentioned this issue Apr 20, 2023

Enhancement Request: Intel Relaxed Allocation Limits or CL_MEM_ALLOW_UNRESTRICTED_SIZE_INTEL xmrig/xmrig#3257

Open

vpirogov mentioned this issue Apr 25, 2023

Allow arrays larger than 4GB on GPUs oneapi-src/oneDNN#1638

Closed

BrosnanYuen mentioned this issue Aug 23, 2023

Consistently getting noise as output with Intel Arc comfyanonymous/ComfyUI#556

Closed

Nuullll mentioned this issue Dec 18, 2023

[IPEX] Slice SDPA into smaller chunks AUTOMATIC1111/stable-diffusion-webui#14353

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too low 4GB allocation limit on Intel Arc GPUs (CL_DEVICE_MAX_MEM_ALLOC_SIZE) #627

Too low 4GB allocation limit on Intel Arc GPUs (CL_DEVICE_MAX_MEM_ALLOC_SIZE) #627

ProjectPhysX commented Mar 14, 2023 •

edited

Loading

jinz2014 commented Mar 14, 2023

ProjectPhysX commented Mar 14, 2023

jinz2014 commented Mar 14, 2023

BA8F0D39 commented Mar 15, 2023 •

edited

Loading

jinz2014 commented Mar 15, 2023

BA8F0D39 commented Mar 21, 2023 •

edited

Loading

MichalMrozek commented Mar 21, 2023

FreddieWitherden commented Apr 3, 2023

iamhumanipromise commented Apr 12, 2023

MaciejPlewka commented Apr 13, 2023 •

edited

Loading

iamhumanipromise commented Apr 14, 2023 •

edited

Loading

ProjectPhysX commented Apr 15, 2023 •

edited

Loading

bashbaug commented Apr 16, 2023

ProjectPhysX commented Apr 16, 2023

ElliottDyson commented Jan 12, 2024 •

edited

Loading

dbenedb commented Jan 21, 2024

ElliottDyson commented Jan 21, 2024

lorenzeszz commented Sep 24, 2024

simonlui commented Oct 25, 2024 •

edited

Loading

Too low 4GB allocation limit on Intel Arc GPUs (CL_DEVICE_MAX_MEM_ALLOC_SIZE) #627

Too low 4GB allocation limit on Intel Arc GPUs (CL_DEVICE_MAX_MEM_ALLOC_SIZE) #627

Comments

ProjectPhysX commented Mar 14, 2023 • edited Loading

jinz2014 commented Mar 14, 2023

ProjectPhysX commented Mar 14, 2023

jinz2014 commented Mar 14, 2023

BA8F0D39 commented Mar 15, 2023 • edited Loading

jinz2014 commented Mar 15, 2023

BA8F0D39 commented Mar 21, 2023 • edited Loading

MichalMrozek commented Mar 21, 2023

FreddieWitherden commented Apr 3, 2023

iamhumanipromise commented Apr 12, 2023

MaciejPlewka commented Apr 13, 2023 • edited Loading

iamhumanipromise commented Apr 14, 2023 • edited Loading

ProjectPhysX commented Apr 15, 2023 • edited Loading

bashbaug commented Apr 16, 2023

ProjectPhysX commented Apr 16, 2023

ElliottDyson commented Jan 12, 2024 • edited Loading

dbenedb commented Jan 21, 2024

ElliottDyson commented Jan 21, 2024

lorenzeszz commented Sep 24, 2024

simonlui commented Oct 25, 2024 • edited Loading

ProjectPhysX commented Mar 14, 2023 •

edited

Loading

BA8F0D39 commented Mar 15, 2023 •

edited

Loading

BA8F0D39 commented Mar 21, 2023 •

edited

Loading

MaciejPlewka commented Apr 13, 2023 •

edited

Loading

iamhumanipromise commented Apr 14, 2023 •

edited

Loading

ProjectPhysX commented Apr 15, 2023 •

edited

Loading

ElliottDyson commented Jan 12, 2024 •

edited

Loading

simonlui commented Oct 25, 2024 •

edited

Loading