-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too low 4GB allocation limit on Intel Arc GPUs (CL_DEVICE_MAX_MEM_ALLOC_SIZE) #627
Comments
What is the Linux kernel installed ? |
@jinz2014 kernel 6.2.0-060200-generic, on Ubuntu 22.04. |
#617 |
Not an intel employee but I found many bugs related to memory transfer
Should we file a bug at https://bugzilla.kernel.org ? |
@BA8F0D39 Thank you for your summary. Are there links/reproducers available for each item in your list ? |
|
Those MapBuffer test have WriteInvalidate flags, which means it doesn't need to do memory transfer at all, as host will be overwriting the contents. That's why reported value is high, as there is no transfer, so it is very short in time. This shows that driver properly optimizes those API calls. |
I am unsure how practical this is. A lot of optimizations in the compiler are based around the buffer size being limited to 4 GiB. From what I can gather from the ISA a lot of the memory instructions only support |
I have a laptop with 9th Gen Core i9 (9th Gen gfx also), 32GB RAM. Random Discord convos have talked about using it for Stable Diffusion. That being said, the 4GB limit seems to apply there as well. So is this some sort of carryover from GFX8/GFX9 cards? Change the allocation size when using dedicated GPU vs. iGPU? |
It's possible to use allocations greater than 4GB. Please take a look at this guide https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md |
That's the programmer's guide. I'm specifically talking about after the python environment has been launched and I'm executing already-generated code without this flag enabled. Looks like I have to file an issue with that project, in this case: it seems as it has been closed, there is no way to to make this work without the dev modifying it? (Or forking, modifying, etc) |
I just tried it again with my A750 and the latest driver To reproduce:
Change
Set the benchmark grid resolution in Compile and run with:
If it's broken, it will show impossibly large performance/bandwidth, like
If it works, reported bandwidth will be realistic, like:
|
Hi @ProjectPhysX, please note that you fundamentally need two changes to make >4GB allocations "work":
If you want to play around with >4GB allocations without hacking around in your code, please consider trying the OpenCL Intercept layer and specifically the RelaxAllocationLimits control, which will automatically do both of these steps for you. |
Thanks, it works! I totally missed that build flag. Would be better to enable above 4GB allocations by default though. |
Any updates on enabling this by default? I understand there seems to be some performance compromises if this is done? Just out of curiosity, are these being worked on? If this has found to no be possible, could a method be implemented so that instead of erroring it gets the CPU to send over 4GB chunks at a time, then the remaining amount, so that greater than 4GB can be sent in "one go" still, whether it be for PyTorch, or any other similar applications that need to make use of greater than 4GB of allocation? I understand that custom compiling of say, PyTorch, is something that can be done as you have suggested with custom compiler flags. However, some of us aren't developers which makes that quite a hurdle. Thank you |
I'm seconding @ElliottDyson request. 4GB allocation limit makes Intel's GPUs like ARC A770 useless for any Stable Diffusion work. |
I'm not sure how visible this thread is since it's been closed. Perhaps we should be opening a new one that references this? Not sure what typical GitHub etiquette is about something like this though, which is why I haven't done so yet. |
With Intel UHD it is only 1.86GB on a 32GB RAM system. OpenVINO GPU Plugin is useless for me. |
I'm going to add another usecase that I have hit the wall with that indisputably needs a larger buffer size than this 4GB limit that has been enforced. Video diffusion has now started to hit its stride with bigger models coming out like Mochi but even when quantized to 8 bit, I can not generate more than 7 frames with ComfyUI and a corresponding wrapper plugin because the generated data is overflowing some memory somewhere and I either get a |
The
CL_DEVICE_MAX_MEM_ALLOC_SIZE
on Intel Arc GPUs is currently set to 4GB (A770 16GB) and 3.86GB (A750). Trying to allocate larger buffers makes thecl::Buffer
constructor return error-61
. Disabling the error by setting buffer flag(1<<23)
during allocation turns compute results into nonsense when the buffer size is larger than 4GB.This is likely related to 32-bit integer overflow in index calculation.
A 4GB limit on buffer allocation is not contemporary in 2023, especially on a 16GB GPU; it's not 2003 anymore where computers were limited to 32-bit. A lot of software needs to be able to allocate larger buffers in order to fully use the available VRAM capacity. FluidX3D for example needs up to 82% of VRAM in a single buffer; if the allocation limit is 25% on the A770 16GB, only 4.9GB of the 16GB can be used by the software.
The limit should be removed altogether, by setting
CL_DEVICE_MAX_MEM_ALLOC_SIZE
=CL_DEVICE_GLOBAL_MEM_SIZE
= 100% of physical VRAM capacity, and making sure that array indices are computed with 64-bit integers. Nvidia and AMD both allow full VRAM allocation in a single buffer for a long time already.The text was updated successfully, but these errors were encountered: