Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrays larger than 4 GB crashes #325

Open
BA8F0D39 opened this issue Apr 8, 2023 · 52 comments
Open

Arrays larger than 4 GB crashes #325

BA8F0D39 opened this issue Apr 8, 2023 · 52 comments
Labels
ARC ARC GPU Crash Execution crashes

Comments

@BA8F0D39
Copy link

BA8F0D39 commented Apr 8, 2023

Describe the bug

Intel compute runtime doesn't allow allocating a buffer bigger than 4 GB.

intel/compute-runtime#627

When you allocate an array in intel-extension-for-pytorch bigger than 4 GB in A770 16GB, it crashes.

x = torch.rand(46000, 46000, dtype=torch.float32, device='xpu')

Is it possible to allocate multiple buffers for an array instead of allocating one buffer for one array?

Versions

Collecting environment information...
PyTorch version: 1.13.0a0+gitb1dde16
PyTorch CXX11 ABI: Yes
IPEX version: 1.13.10+xpu
IPEX commit: 7d85b0e92
Build type: Release

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: N/A
IGC version: N/A
CMake version: N/A
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-6.3.0-1-x86_64-with-glibc2.35
Is XPU available: True
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration: 
[0] _DeviceProperties(name='Intel(R) Graphics [0x56a0]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512)
Intel OpenCL ICD version: 22.43.24595.35+i538~22.04
Level Zero version: 1.3.24595.35+i538~22.04

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          20
On-line CPU(s) list:             0-19
Vendor ID:                       GenuineIntel
BIOS Vendor ID:                  Intel(R) Corporation
Model name:                      13th Gen Intel(R) Core(TM) i5-13600K
BIOS Model name:                 13th Gen Intel(R) Core(TM) i5-13600K
CPU family:                      6
Model:                           183
Thread(s) per core:              2
Core(s) per socket:              14
Socket(s):                       1
Stepping:                        1
CPU max MHz:                     5100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6991.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       544 KiB (14 instances)
L1i cache:                       704 KiB (14 instances)
L2 cache:                        20 MiB (8 instances)
L3 cache:                        24 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-19
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==1.13.10+xpu
[pip3] numpy==1.24.1
[pip3] torch==1.13.0a0+gitb1dde16
[pip3] torchvision==0.14.1a0+0504df5
[conda] N/A

@BA8F0D39 BA8F0D39 closed this as completed Apr 8, 2023
@BA8F0D39 BA8F0D39 reopened this Apr 14, 2023
@jingxu10 jingxu10 added ARC ARC GPU Crash Execution crashes labels Apr 16, 2023
@jingxu10
Copy link
Contributor

@tye1

@BA8F0D39
Copy link
Author

I did some further tests and it seems like allocating more than 4GB returns garbage or randomly crashes.

Example of allocating less than 4GB in A770 16GB. The mean is around 0.5 which is expected.

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(30000, 30000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())


python3 ./test.py 
 Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Mean
0.50001085

Example of allocating more than 4GB on CPU. The mean is around 0.5 which is expected.

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='cpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())



python3 ./test.py 
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Mean
0.4999941

Example of allocating more than 4GB on A770 16GB. The mean is around 0.014 which is completely wrong.

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())


python3 ./test.py 
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Mean
0.014004011

In conclusion, allocating more than 4GB crashes or returns complete garbage.

@BA8F0D39
Copy link
Author

@jingxu10
Is memory allocation done by OpenCL, Level Zero, or OneDNN?

@jingxu10
Copy link
Contributor

It should be allocated by Level-0.
@gujinghui

@BA8F0D39
Copy link
Author

@jingxu10

Will passing -ze-opt-greater-than-4GB-buffer-required into the build options fix it?

https://spec.oneapi.io/level-zero/latest/core/PROG.html#module-build-options

@cchheennhhaaoo
Copy link

cchheennhhaaoo commented Apr 27, 2023

Hi, @BA8F0D39
What's the driver version? I cannot reproduce randomly crash with agama-ci-devel-602. From what I've tried, the max workable input shape of your ut is about 59500*59500, corresponds memory size of 13.2G. It is a reasonable result.
For accuracy issue, we will check it.

@zejun-chen
Copy link
Contributor

Hi @BA8F0D39

Thank you for using intel product and IPEX.
Now we can successfully create large memory(not larger than total physical memory size) and compute well.
Can you provide the driver version you are using by the below?
sudo dpkg -l | grep intel

And is it possible to add the following flags and attach the log here when you find the error?

export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1

Thank you.

@BA8F0D39
Copy link
Author

BA8F0D39 commented Apr 27, 2023

@cchheennhhaaoo

On windows 11 WSL

ii  intel-level-zero-gpu                  1.3.24595.35+i538~22.04                 amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-oneapi-runtime-ccl              2021.8.0-25371                          amd64        Intel® oneAPI Collective Communications Library runtime
ii  intel-oneapi-runtime-compilers        2023.0.0-25370                          amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-compilers-common 2023.0.0-25370                          all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-dpcpp-cpp        2023.0.0-25370                          amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-dpcpp-cpp-common 2023.0.0-25370                          all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-mkl              2023.0.0-25398                          amd64        Intel® oneAPI Math Kernel Library runtime
ii  intel-oneapi-runtime-mkl-common       2023.0.0-25398                          all          Intel® oneAPI Math Kernel Library runtime common
ii  intel-oneapi-runtime-mpi              2021.8.0-25329                          amd64        Intel® MPI Library runtime
ii  intel-oneapi-runtime-opencl           2023.0.0-25370                          amd64        Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii  intel-oneapi-runtime-openmp           2023.0.0-25370                          amd64        Intel® OpenMP* Runtime Library runtime
ii  intel-oneapi-runtime-openmp-common    2023.0.0-25370                          all          l_openmp.runtime.description>
ii  intel-oneapi-runtime-tbb              2021.8.0-25334                          amd64        Intel® oneAPI Threading Building Blocks runtime
ii  intel-oneapi-runtime-tbb-common       2021.8.0-25334                          all          Intel® oneAPI Threading Building Blocks runtime common
ii  intel-opencl-icd                      22.43.24595.35+i538~22.04               amd64        Intel graphics compute runtime for OpenCL

Code

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1
          zeCommandQueueCreate = 1     \--->         zeCommandQueueDestroy = 1
                zeModuleCreate = 1     \--->               zeModuleDestroy = 1
                zeKernelCreate = 1     \--->               zeKernelDestroy = 1
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
  zeCommandListCreateImmediate = 1     |
           zeCommandListCreate = 2     \--->          zeCommandListDestroy = 3
                 zeEventCreate = 8     \--->                zeEventDestroy = 8
                 zeFenceCreate = 2     \--->                zeFenceDestroy = 2
                 zeImageCreate = 0     \--->                zeImageDestroy = 0
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
              zeMemAllocDevice = 1     |
                zeMemAllocHost = 0     |
              zeMemAllocShared = 0     \--->                     zeMemFree = 0     ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
Aborted

crashlog.txt

@BA8F0D39
Copy link
Author

BA8F0D39 commented Apr 27, 2023

@cchheennhhaaoo
@zejun-chen

On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.

ii  intel-level-zero-gpu                  1.3.25593.18-601~22.04                   amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-oneapi-runtime-ccl              2021.9.0-43543                           amd64        Intel® oneAPI Collective Communications Library runtime
ii  intel-oneapi-runtime-compilers        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-compilers-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-dpcpp-cpp        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-dpcpp-cpp-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-mkl              2023.1.0-46342                           amd64        Intel® oneAPI Math Kernel Library runtime
ii  intel-oneapi-runtime-mkl-common       2023.1.0-46342                           all          Intel® oneAPI Math Kernel Library runtime common
ii  intel-oneapi-runtime-mpi              2021.9.0-43482                           amd64        Intel® MPI Library runtime
ii  intel-oneapi-runtime-opencl           2023.1.0-46305                           amd64        Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii  intel-oneapi-runtime-openmp           2023.1.0-46305                           amd64        Intel® OpenMP* Runtime Library runtime
ii  intel-oneapi-runtime-openmp-common    2023.1.0-46305                           all          l_openmp.runtime.description>
ii  intel-oneapi-runtime-tbb              2021.9.0-43484                           amd64        Intel® oneAPI Threading Building Blocks runtime
ii  intel-oneapi-runtime-tbb-common       2021.9.0-43484                           all          Intel® oneAPI Threading Building Blocks runtime common
ii  intel-opencl-icd                      23.05.25593.18-601~22.04                 amd64        Intel graphics compute runtime for OpenCL
ii  libdrm-intel1:amd64                   2.4.115+git2303241447.28d9a3c4~j~mesarc0 amd64        Userspace interface to intel-specific kernel DRM services -- runtime

Code

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())

Crash

ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1    
          zeCommandQueueCreate = 2     \--->         zeCommandQueueDestroy = 2    
                zeModuleCreate = 2     \--->               zeModuleDestroy = 2    
                zeKernelCreate = 3     \--->               zeKernelDestroy = 3    
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1    
  zeCommandListCreateImmediate = 1     | 
           zeCommandListCreate = 5     \--->          zeCommandListDestroy = 6    
                 zeEventCreate = 18    \--->                zeEventDestroy = 18   
                 zeFenceCreate = 5     \--->                zeFenceDestroy = 5    
                 zeImageCreate = 0     \--->                zeImageDestroy = 0    
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0    
              zeMemAllocDevice = 2     | 
                zeMemAllocHost = 0     | 
              zeMemAllocShared = 0     \--->                     zeMemFree = 1     ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
 
 
Aborted (core dumped)

crash2.txt

@cchheennhhaaoo
Copy link

I believe this issue is caused by incorrect env setting. You can follow this blog to setup IPEX environment on WSL2 with docker: https://medium.com/intel-analytics-software/stable-diffusion-with-intel-arc-gpus-f2986bba8365

@BA8F0D39
Copy link
Author

BA8F0D39 commented Apr 27, 2023

@cchheennhhaaoo
@zejun-chen
I have the same problem on Ubuntu Linux too (not using windows)

On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.

ii  intel-level-zero-gpu                  1.3.25593.18-601~22.04                   amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-oneapi-runtime-ccl              2021.9.0-43543                           amd64        Intel® oneAPI Collective Communications Library runtime
ii  intel-oneapi-runtime-compilers        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-compilers-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-dpcpp-cpp        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-dpcpp-cpp-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-mkl              2023.1.0-46342                           amd64        Intel® oneAPI Math Kernel Library runtime
ii  intel-oneapi-runtime-mkl-common       2023.1.0-46342                           all          Intel® oneAPI Math Kernel Library runtime common
ii  intel-oneapi-runtime-mpi              2021.9.0-43482                           amd64        Intel® MPI Library runtime
ii  intel-oneapi-runtime-opencl           2023.1.0-46305                           amd64        Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii  intel-oneapi-runtime-openmp           2023.1.0-46305                           amd64        Intel® OpenMP* Runtime Library runtime
ii  intel-oneapi-runtime-openmp-common    2023.1.0-46305                           all          l_openmp.runtime.description>
ii  intel-oneapi-runtime-tbb              2021.9.0-43484                           amd64        Intel® oneAPI Threading Building Blocks runtime
ii  intel-oneapi-runtime-tbb-common       2021.9.0-43484                           all          Intel® oneAPI Threading Building Blocks runtime common
ii  intel-opencl-icd                      23.05.25593.18-601~22.04                 amd64        Intel graphics compute runtime for OpenCL
ii  libdrm-intel1:amd64                   2.4.115+git2303241447.28d9a3c4~j~mesarc0 amd64        Userspace interface to intel-specific kernel DRM services -- runtime

Code

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())

Crash

ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1    
          zeCommandQueueCreate = 2     \--->         zeCommandQueueDestroy = 2    
                zeModuleCreate = 2     \--->               zeModuleDestroy = 2    
                zeKernelCreate = 3     \--->               zeKernelDestroy = 3    
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1    
  zeCommandListCreateImmediate = 1     | 
           zeCommandListCreate = 5     \--->          zeCommandListDestroy = 6    
                 zeEventCreate = 18    \--->                zeEventDestroy = 18   
                 zeFenceCreate = 5     \--->                zeFenceDestroy = 5    
                 zeImageCreate = 0     \--->                zeImageDestroy = 0    
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0    
              zeMemAllocDevice = 2     | 
                zeMemAllocHost = 0     | 
              zeMemAllocShared = 0     \--->                     zeMemFree = 1     ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
 
 
Aborted (core dumped)

crash2.txt

@fredlarochelle
Copy link

I am able to replicate the same issue on Fedora 37 with 6.2 and Ubuntu 22.04 with 5.19. Both instances involve a build from the latest xpu-master branch.

@BA8F0D39
Copy link
Author

It is weird the crash error is only reported when you enable DEBUG flags, otherwise the code silently crashes.

export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1

@fredlarochelle
Copy link

Here is some quick findings I had, it's not exactly at 4GB, I don't think the gibberish is related...

# All good
import torch
import intel_extension_for_pytorch as ipex

array = torch.rand(40000, 40000, dtype=torch.bfloat16, device='xpu')

print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #3.2GB
print("Mean:", torch.mean(array).item()) #0.5
print("Standard Deviation:", torch.std(array).item()) #0.287109375
# All good
import torch
import intel_extension_for_pytorch as ipex

array = torch.rand(46000, 46000, dtype=torch.bfloat16, device='xpu')

print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #4.232GB
print("Mean:", torch.mean(array).item()) #0.5
print("Standard Deviation:", torch.std(array).item()) #0.2890625
# At 46001x46001 it goes gibberish
import torch
import intel_extension_for_pytorch as ipex

array = torch.rand(46001, 46001, dtype=torch.bfloat16, device='xpu')

print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #4.423218400GB
print("Mean:", torch.mean(array).item()) #0.00372314453125
print("Standard Deviation:", torch.std(array).item()) #0.049072265625

For FP16, I have some other weird bugs that sometimes it works, sometimes it doesn't even for small array (less than 10000x10000). Even for multiple consecutive run, it might work for 50 times in a row, than go bonkers for 10.

For FP32, the gibberish starts appearing at around 30800x30800 which is 3.79456GB. Before that starting around 30400x30400, it is gibberish and then a good output in alternance when doing multiple succesive runs.

Which such numerical instability, I might write a script and test every possible combination at this point, might be worth to take a look at other random sampling methods too.

@fredlarochelle
Copy link

Just did another quick run for FP32 at 30800x30800 and this time, it works just fine (even 32000x32000 works this time around), there is some weird instability going on...

Quick thought, since I am not using a fixed seed in those tests, might it be that some "bad seeds" are cause the instability?

@BA8F0D39
Copy link
Author

BA8F0D39 commented May 25, 2023

@fredlarochelle
I think some pointers in OneDNN GPU kernel use 32bit unsigned integers and some use 64bit unsigned integers. Reading more than 4GB creates a buffer over-read (reading adjacent memory locations and reading other arrays).

If the adjacent memory locations just so happens to have zeros, then the mean is around 0.

If the adjacent memory locations just so happens to have uniformly distributed values from 0 to 1, then the mean is 0.5 .

It could allow you to read other program's data in the GPU.

@fredlarochelle
Copy link

@BA8F0D39 That would make sense, but I still do get the instability for FP16 and FP32 start acting weird before it before it would actually overfill a 32bit buffer + instability, there is probably more than one problem going on at the same time.

@fengyuan14
Copy link

0.2890625

@fredlarochelle @BA8F0D39 Thanks for feedbacks.

The issue mentioned here (so-called numerical instability) looks like one we met recently in internal test. The issue might be caused cache consistency after global memory fence. We are following.

BTW, as for crashes when allocating memory larger than 4GB, we cannot reproduce on recommended driver.

@BA8F0D39
Copy link
Author

@arthuryuan1987
On Windows 11 with WSL, it crashes 100% of the time.

On Ubuntu Linux 22.04 with 5.19 out of tree driver (intel-i915-dkms intel-platform-vsec-dkms intel-platform-cse-dkms intel-fw-gpu), it randomly crashes and it is not deterministic.
https://dgpu-docs.intel.com/driver/client/overview.html

On Ubuntu Linux 22.04 with 6.3 mainline kernel, it also randomly crashes.

I can force it to crash 100% of the time if you enable debug flags.

export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1

@fredlarochelle
Copy link

@arthuryuan1987 I am on Ubuntu 22.04.2 5.19.0.41-generic, on the lastest driver, all following the installation instructions in the documentation with a build from the lastest commit in the xpu-master branch.

@BA8F0D39
Copy link
Author

BA8F0D39 commented Jun 9, 2023

@arthuryuan1987

I used a Vulkan GPU memory tester.
https://github.com/GpuZelenograd/memtest_vulkan

It seems all memory regions above 4GB are corrupt and the read transfer speed is 1.9 GB/s.

./memtest_vulkan 1 9140000000 
Error found. Mode NEXT_RE_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x30000000..=0xAFFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6673 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0xE0000000..=0x15FFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6673 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x190000000..=0x20FFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6672 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Standard 5-minute test of 1: Bus=0x03:00 DevId=0x56A0   16GB Intel(R) Arc(tm) A770 Graphics (DG2)
      1 iteration. Passed  5.6310 seconds  written:    5.5GB 956.2GB/sec        checked:    8.2GB   1.5GB/sec
Error found. Mode NEXT_RE_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x30000000..=0xAFFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6673 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0xE0000000..=0x15FFFFFFF  iteration:2
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 760|6653 42k 204k789k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

@fengyuan14
Copy link

@BA8F0D39 I checked the repo, https://github.com/GpuZelenograd/memtest_vulkan
It should be OpenCL based application (tool). As I know, A64 stateless addressing has a big performance penalty on ARC. Maybe, I guess OpenCL driver disables >4GB allocation. Regarding stacks of IPEX, not all underlying stacks guarantee A64 stateless addressing. So after next code synchronization, IPEX will raise an explicit error to users, as well.

@fredlarochelle
Copy link

Could you please provide an update on the status of this issue? On the lastest xpu_master branch, I have observed that it is currently exhibiting intermittent behavior. At times, when allocating a batch size larger than 4 GB, it crashes with the -5 error, while other times it functions correctly without any issues. Or might the -5 error I am getting be related to another issue? Interestingly, from my observations, the error does not seem to occur when the batch size remains under 4 GB.

@BA8F0D39
Copy link
Author

BA8F0D39 commented Aug 23, 2023

@cchheennhhaaoo
I still can allocate more than 4GB on Intel Arc with IPEX 2.0.110+xpu. However, inputting large images into resnet50 produces invalid results even-though only 8GB of 16GB Intel Arc A770 is used.

@Serizao
Copy link

Serizao commented Sep 5, 2023

I have exactly the same bug : during torch finetuning, my script crash with PI_ERROR_INVALID_MEM_OBJECT full stack:


---> piContextRelease(
        <unknown> : 0x5691520
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
) --->  pi_result : PI_SUCCESS

---> piKernelRelease(
        <unknown> : 0xf4d7d10
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) --->  pi_result : PI_SUCCESS

---> piProgramRelease(
        <unknown> : 0xf4c3e20
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
) --->  pi_result : PI_SUCCESS

---> piKernelRelease(
        <unknown> : 0x1053fe50
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) --->  pi_result : PI_SUCCESS

---> piKernelRelease(
        <unknown> : 0x1053f3e0
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) --->  pi_result : PI_SUCCESS

---> piProgramRelease(
        <unknown> : 0xf4e63a0
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
) --->  pi_result : PI_SUCCESS

---> piKernelRelease(
        <unknown> : 0xf4e73c0
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) --->  pi_result : PI_SUCCESS

---> piKernelRelease(
        <unknown> : 0xf4e56d0
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) --->  pi_result : PI_SUCCESS

---> piProgramRelease(
        <unknown> : 0xf4c3f50
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
) --->  pi_result : PI_SUCCESS

---> piDeviceRelease(
        <unknown> : 0xf474a70
) --->  pi_result : PI_SUCCESS

---> piDeviceRelease(
        <unknown> : 0xf475160
) --->  pi_result : PI_SUCCESS

---> piTearDown(
        <unknown> : 0
) --->  pi_result : PI_SUCCESS
        [out]void * : 0

---> piTearDown(
        <unknown> : 0
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1
          zeCommandQueueCreate = 2     \--->         zeCommandQueueDestroy = 2
                zeModuleCreate = 3     \--->               zeModuleDestroy = 3
                zeKernelCreate = 5     \--->               zeKernelDestroy = 5
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
  zeCommandListCreateImmediate = 1     |
           zeCommandListCreate = 3     \--->          zeCommandListDestroy = 4
                 zeEventCreate = 7     \--->                zeEventDestroy = 7
                 zeFenceCreate = 3     \--->                zeFenceDestroy = 3
                 zeImageCreate = 0     \--->                zeImageDestroy = 0
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
              zeMemAllocDevice = 2     |
                zeMemAllocHost = 0     |
              zeMemAllocShared = 0     \--->                     zeMemFree = 1     ---> LEAK = 1
) --->  pi_result : -38
        [out]void * : 0

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
Abandon (core dumped)

i have an Arc A 770 with 16Gb of memory. To do this i i last transformer version wich integrate XPU to compute. Is a fix to use all available memory planned?

@cchheennhhaaoo
Copy link

@cchheennhhaaoo I still can allocate more than 4GB on Intel Arc with IPEX 2.0.110+xpu. However, inputting large images into resnet50 produces invalid results even-though only 8GB of 16GB Intel Arc A770 is used.

Please check this line in your repo.
https://github.com/intel/intel-extension-for-pytorch/blob/xpu-master/csrc/gpu/runtime/CachingDeviceAllocator.cpp#L190

For invalid result issue, please refer to above arthuryuan1987's comment.

@tye1
Copy link

tye1 commented Sep 19, 2023

@BA8F0D39 @fredlarochelle we don't plan to support this. You can still allocate > 4GB with 2.0.110+xpu because we disabled the allocation in master not the previous released drop. Could you please provide the justification why >4GB allocation is required?

@BA8F0D39
Copy link
Author

@tye1

Using an image larger than 768x512 in stable diffusion 1.5 results in a blank or garbled image when pytorch doesn't even use all of the 16 GB in A770
comfyanonymous/ComfyUI#556

Every LLM is bigger than 4GB and they all fail to load on the A770 even though they can fit the VRAM

Other huge models and datasets bigger than 4GB runs out of memory.
#421

@fredlarochelle
Copy link

fredlarochelle commented Sep 21, 2023

@tye1 Pretty much what @BA8F0D39 said and that you need to use work arounds that you don't need to use with a Nvidia GPUs. For example, using a smaller batch size and loading multiple separate batch on the GPU, ...

The main problem I would say tho is a lot of Pytorch code you can find around the internet simply assume that you can allocate more than 4GB since it's supported on Nvidia GPUs.

@BA8F0D39
Copy link
Author

@tye1
All modern transformer/GAN models are larger than 4GB and they all fail with IPEX
#492

@ElliottDyson
Copy link

@tye1 This is also an issue I have. For highly complex models and long sequence lengths, even a batch size of 1 has the possibility of being larger than 4GB. Such limits should be determined by the VRAM capacity of the GPU, rather than in software I would have thought.

@ghost
Copy link

ghost commented Dec 26, 2023

Are there any updates on this or is the stance still "we don't plan to support this"? Asking, since if it's the latter, I'd be looking to sell my gpu sooner rather than later.

@ElliottDyson
Copy link

ElliottDyson commented Dec 27, 2023

Are there any updates on this or is the stance still "we don't plan to support this"? Asking, since if it's the latter, I'd be looking to sell my gpu sooner rather than later.

Same here. Machine learning is the only reason I paid extra for a 16gb card.

@tye1
Copy link

tye1 commented Jan 4, 2024

Sorry for the late response. We disable >4GB memory allocation on ARC770 as there are some hardware limitations on ARC, and there will be significant performance drop as penalty to trade off. This is not acceptable in IPEX’s usage scenarios, hence we have disabled it.

@ElliottDyson
Copy link

ElliottDyson commented Jan 4, 2024

Sorry for the late response. We disable >4GB memory allocation on ARC770 as there are some hardware limitations on ARC, and there will be significant performance drop as penalty to trade off. This is not acceptable in IPEX’s usage scenarios, hence we have disabled it.

Again, thank you for the great work on this project,

But is there any possibility to have it enabled still but with a warning that comes up when exceeding 4GB of allocation and notes that performance would be significantly reduced? I imagine it's still better than CPU processing, which is the only alternative I (and I'm sure others too) have available.

Again, the only reason I bought this 16GB card was the potential for machine learning, so only being able to use 4/16GB is really rather frustrating, I hope you can see where I'm coming from.

I also understand if there's absolutely nothing you can do, it would just be really rather disappointing. If that is the case, perhaps this is maybe something the ARC/IPEX team could work on to make it a possibility? If not directly possible in this extension that is.

Thank you

@jgong5
Copy link
Contributor

jgong5 commented Jan 4, 2024

Again, the only reason I bought this 16GB card was the potential for machine learning, so only being able to use 4/16GB is really rather frustrating, I hope you can see where I'm coming from.

I guess this would only impact the case where you have to allocate big memory chunks which are larger than 4GB. If your workload doesn't need such big chunk, you can still allocate large enough memory in total up to 16GB (maybe a little lower than that due to the need from runtime/driver)?

@ElliottDyson
Copy link

ElliottDyson commented Jan 4, 2024

Again, the only reason I bought this 16GB card was the potential for machine learning, so only being able to use 4/16GB is really rather frustrating, I hope you can see where I'm coming from.

I guess this would only impact the case where you have to allocate big memory chunks which are larger than 4GB. If your workload doesn't need such big chunk, you can still allocate large enough memory in total up to 16GB (maybe a little lower than that due to the need from runtime/driver)?

Unfortunately that's the issue, with long sequences and large model sizes when using transformer encoders, my use case requires being able to move more than 4GB in one go. Unless there is a way built into this extension that automatically splits the model into chunks before loading it into memory (same with samples and/or batches)?

P.s. Or even a manual way to do this?

@go2tom42
Copy link

Are there any updates on this or is the stance still "we don't plan to support this"? Asking, since if it's the latter, I'd be looking to sell my gpu sooner rather than later.

Same here. Machine learning is the only reason I paid extra for a 16gb card.

Same here it's the ONLY reason I bought this card, first Intel product I've bought in 15 years, and it will be the last

@Nuullll
Copy link

Nuullll commented Jan 16, 2024

I implemented a W/A in stable-diffusion-webui for scaled_dot_product_attention which is memory intensive (so easily triggers the 4GB limitation on Arc): AUTOMATIC1111/stable-diffusion-webui#14353, by slicing large-batch SDPA into smaller chunks.

I'm wondering whether such mechanism could be implemented at IPEX framework level. Adding IPEX W/A in upper level applications is just not scalable.

@ElliottDyson
Copy link

I implemented a W/A in stable-diffusion-webui for scaled_dot_product_attention which is memory intensive (so easily triggers the 4GB limitation on Arc): AUTOMATIC1111/stable-diffusion-webui#14353, by slicing large-batch SDPA into smaller chunks.

I'm wondering whether such mechanism could be implemented at IPEX framework level. Adding IPEX W/A in upper level applications is just not scalable.

Whilst a neat idea, batches can be easily sent in smaller chunks via just a software implementation with PyTorch, so I can't see much of a need for this at the core level. Forgive me if I'm wrong. Something that can't be fixed in software, only firmware/library-level is if you're already running stochastic gradient descent (batch size of 1) and are still exceeding the 4GB limit.

@dbenedb
Copy link

dbenedb commented Jan 21, 2024

I'm ok with >4GB allocation to cause some slowdowns. But if this cannot be implemented at all that means your GPUs are practically (not theoretically or technically) useless for Stable Diffusion and I'm going to sell my A770.

@ghchris2021
Copy link

I am still working my way through the Intel ARC documentation with respect to how the global / process / surface / execution unit / physical / virtual etc. etc. addressing works at the architecture level, and I have no full idea of how the multiple Intel driver / compute software layers above the HW affect the memory limitations but I'd like to better understand where these limitations are between the HW / driver / compute SW stack.

It is disappointing for ARC A770-16 to have a 16GBy VRAM GPU and not be (as a programmer) able to easily just access as much data as desired at least anywhere in the card's VRAM (and also in my host side's application data RAM while programming, ideally beyond even those limits as I exemplify below (q.v.)).

It makes me concerned for Battlemage, Celestial, Druid as well since apparently the programmer's model of memory access for the nvidia GPUs has been (IMO) so much better even on their consumer GPUs for several past GPU generations.

I gladly got the ARC A770-16 to use its 16GBy ram for GPGPU and I can see from several intel documents there are
at least in parts of the supported architecture capabilities to access 64 bit addresses, and 48 bit virtual addresses, so
at first glance I don't see why there is such a limitation as this now, and I certainly hope that as the Intel GPU
line progresses (Battlemage, Celestial, Druid, ...) that the VRAM size per offered card model will increase into
the 24-64+ GBy range, that the Intel consumer motherboard platforms will evolve to support wider and 256-512GBy RAM,
and in such cases it seems that it's only natural to hope / expect that GPU / CPU virtual addressing can become
seamless and extend to the system's VM size encompassing all system physical RAM and I/O if desired.

From Intel documentation showing mostly hopeful capabilities (though maybe SW is turning some things into SW limitations?):

Graphics Virtual Memory
...Although the range of supported graphics virtual addresses varies, most GPU commands and GPU
instructions use a common 64 bit definition for a graphics virtual address.

Per-Process GTT with 48b VA
The GPU typically operates on behalf of user-level processes (applications), each of which has it's own
"Per-Process" virtual address space. The size of this space is 256TB (48b address width).
...
Shared virtual global memory (SVM)
Accessible and visible to all work items on any GPU and the host.
...
NP STATE_BASE_ADDRESS Base addresses for the Instruction, General State, Surface State, and Bindless Surface State memory heaps. GPGPU kernels reference these memory areas using 32b offsets from the 64b base addresses.
...
Address Models
Data structures accessed by the data port are called "surfaces". There are four different Address Models used by the data port to access these surfaces:
...
64-bit Stateless model (A64).
A64 Flat/Stateless Model
This model is primarily intended for programmable shader programs.

Please see the below just for contrast in terms of what I'd consider (as a developer) a most ideal programming model
and therefore the implied capabilities of the SW / HW architecture that goes below it to make it work so seamlessly.

In contrast to the above Intel architecture, looking at this below exemplified case (working already on several generations of consumer NVIDIA GPUs), the developer is able to seamlessly access data anywhere in the VRAM of any of the GPUs in a system, but also CPU memory anywhere in their application's CPU address space, and in fact also CPU virtual addresses
that greatly exceed the physical VRAM of any system GPU / CPU attached RAM, all seamlessly and with efficient / high performance reference to such memory from either CPU application software or GPGPU kernels executing on any of the system's GPUs.

Here's the citations about the programmer's view of memory (as I understand it to be relevant)
in competitive (i.e. both for consumer use gaming cards as well as enterprise ones) NVIDIA GPUs.
The following citations / sources are about CUDA/NVIDIA GPU's "Unified Memory" and "Heterogeneous Memory" models on GPUs including their consumer GPUs:

https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/

https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

Here are small relevant excerpts

The Benefits of Unified Memory on Pascal and Later GPUs

Starting with the Pascal GPU architecture, Unified Memory functionality is significantly improved with 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. The Page Migration engine allows GPU threads to fault on non-resident memory accesses so the system can migrate pages on demand from anywhere in the system to the GPU’s memory for efficient processing.

In other words, Unified Memory transparently enables oversubscribing GPU memory, enabling out-of-core computations for any code that is using Unified Memory for allocations (e.g. cudaMallocManaged()). It “just works” without any modifications to the application, whether running on one GPU or multiple GPUs.

Also, Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs. This is useful in writing efficient multi-GPU cooperative algorithms.

Demand paging can be particularly beneficial to applications that access data with a sparse pattern. In some applications, it’s not known ahead of time which specific memory addresses a particular processor will access. Without hardware page faulting, applications can only pre-load whole arrays, or suffer the cost of high-latency off-device accesses (also known as “Zero Copy”). But page faulting means that only the pages the kernel accesses need to be migrated.
Heterogeneous Memory Management (HMM) is a CUDA memory management feature that extends the simplicity and productivity of the [CUDA Unified Memory](https://developer.nvidia.com/blog/unified-memory-in-cuda-6/) programming model to include system allocated memory on systems with PCIe-connected NVIDIA GPUs. System allocated memory refers to memory that is ultimately allocated by the operating system; for example, through malloc, mmap, the C++ new operator (which of course uses the preceding mechanisms), or related system routines that set up CPU-accessible memory for the application. 

Previously, on PCIe-based machines, system allocated memory was not directly accessible by the GPU. The GPU could only access memory that came from special allocators such as cudaMalloc or cudaMallocManaged. 

With HMM enabled, all application threads (GPU or CPU) can directly access all of the application’s system allocated memory. As with Unified Memory (which can be thought of as a subset of, or precursor to HMM), there is no need to manually copy system allocated memory between processors. This is because it is automatically placed on the CPU or GPU, based on processor usage.
...
Atomic memory operations and synchronization primitives

HMM supports all memory operations, which includes atomic memory operations. That is, programmers may use atomic memory operations to synchronize GPU and CPU threads with flags. 
...
Leverage memory-mapped I/O for fast development 

One of the interesting features that HMM provides is memory-mapped file I/O directly from the GPU. It enables developers to directly read files from supported storage or /disk without staging them in system memory and without copying the data to the high bandwidth GPU memory. 
...
The ERA5 dataset consists of hourly estimates of several atmospheric variables. In the dataset, total precipitation data for each month is stored in a separate file. We used 40 years of total precipitation data from 1981–2020, which sum to 480 input files aggregating to ~1.3 TB total input data size. See Figure 1 for example results.
...
Using the Unix mmap API, input files can be mapped to a contiguous virtual address space. With HMM, this virtual address can be passed as input to a CUDA kernel which can then directly access the values to build a histogram of total precipitation for each hour for all the days in a year. 
...
Enabling and detecting HMM
A GPU with one of the following supported architectures: NVIDIA Turing, NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper, or newer.
...

I'm not sure why we can't have such a capability of a programming model mapping to efficient HW operations
for Arcanist, but I would have expected naturally GPUs with NN GBy VRAM and CPUs with NNN GBy RAM and TBy scale VM could simply access data in physical RAM/VRAM / virtual VM pretty flexibly as exemplified above by now.

IMO it would be nice to see Battlemage, Celestial, Druid, Arcanist improve this aspect of the programming model,
and also finally implement SR-IOV so at least we can easily run graphics / compute in a few VMs (after all consumer desktops already virtualize / MMU / IOMMU everything else and have 128+ GBy RAM with 16+ core CPUs etc.).

@ElliottDyson
Copy link

Sorry for the late response. We disable >4GB memory allocation on ARC770 as there are some hardware limitations on ARC, and there will be significant performance drop as penalty to trade off. This is not acceptable in IPEX’s usage scenarios, hence we have disabled it.

@tye1 This is not acceptable, is a solution that allows >4GB allocation possible?

The advertised VRAM on these cards is 16GB. It's unacceptable to advertise that memory capacity, have it functional, then later disable it leaving the customer with a GPU they no longer can use.

I was very excited to build a server with 7xA770s for machine learning work. In my previous work with A770 I had no issues, which caused me to invest in $2500 worth of GPUs. Think about how frustrating this is from the customer's perspective please.

The thing is, if all they're concerned about is slowdowns, then wouldn't it be easy enough to embed a warning that these slowdowns occur when transferring data in chunks that are greater than 4GB in size.

Significant slowdowns would mean at least it still works. Some functionality is better than no functionality. I'm sure a lot of people would agree with that. @tye1

@ghchris2021
Copy link

The thing is, if all they're concerned about is slowdowns, then wouldn't it be easy enough to embed a warning that these slowdowns occur when transferring data in chunks that are greater than 4GB in size.

Significant slowdowns would mean at least it still works. Some functionality is better than no functionality. I'm sure a lot of people would agree with that. @tye1

Exactly. I mean these 4GB limits (IIRC) have been variously mentioned here (wrt. pytorch programmers), for the OpenCL implementation (OCL programmers), etc.

Ok, so it (the limitation) is something that directly affects GPU/HPC/ML programmers.

As a group that writes HPC / GPGPU code, I think we're especially used to benchmarking / analyzing / optimizing our code wrt. a myriad of trade-offs as to capability vs. speed vs. complexity etc.

"Oh look I'm going beyond {L1, L2, L3} cache size / cache line / page size -- significant performance drop"
ok, expected, but often necessary / desired if one needs the added RAM size.

Same thing using RAM vs registers or accessing RAM non sequentially, or about 50 other cases where real world code must / should deviate from the ideal best case performance strategy and must have the flexibility to do it as the programmer decides best at design time or even run-time.

I'd rather the most flexible / capable possibility "just work" easily, and if I have to optimize things somehow (if even possible) then I'll spend the time to optimize the DSA I used or choose new speed / capability trade-offs if that's even appropriate.

I'm hoping our RAM / VRAM sizes will keep increasing substantially every generation (16G ARCs now, hopefully 32-48G "ECCd" B / C / D / NV / AMD / whatever cards in months / a year or so to come) so it seems key to be able to actually use ("it just works" style) all that VRAM one has paid for (particularly since IIRC as aforementioned "it just works" on the CPU execution device vs the GPU device having the unusual case limit).

@ProjectPhysX
Copy link

ProjectPhysX commented Feb 24, 2024

Single >4GB VRAM allocations are possible on Arc, but currently they require 2 small workarounds in the application. For OpenCL, these are:

  1. In every cl::Buffer/clCreateBuffer allocation, you have to set the buffer flag bit (1<<23), which in the driver is called CL_MEM_ALLOW_UNRESTRICTED_SIZE_INTEL.
  2. In cl::Program::build/clBuildProgram, you have to set the compiler option "-cl-intel-greater-than-4GB-buffer-required".

I've added this to my OpenCL-Wrapper in this commit, so anything built on top of it works on Arc out-of-the-box.

For Level Zero, the workarounds are similar: https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md

I agree that >4GB allocations should be enabled by default. Such a limitation is not contemporary in a time where AI and simulation models commonly use much larger VRAM capacity. Using the full 16GB VRAM capacity of a 16GB GPU has to work no matter what. ISVs should not have to manually add patches only for Arc, to enable basic functionality. Better eliminate this complication and just make it work, and provide the option to disable >4GB allocations for optimization.

@uniartisan
Copy link

I am very curious whether this problem has been solved in the latest stable version of ipex. The only purpose of buying a770 is to develop llm. My code can even run correctly on Moore thread s80 but cannot run on Intel's ipex. At some point I have to allocate large amounts of memory, especially when doing batch inference.

@ElliottDyson
Copy link

I am very curious whether this problem has been solved in the latest stable version of ipex. The only purpose of buying a770 is to develop llm. My code can even run correctly on Moore thread s80 but cannot run on Intel's ipex. At some point I have to allocate large amounts of memory, especially when doing batch inference.

I'd very much be interested in hearing if this is fixed yet.

However you can still do something for now if it isn't:
In the scenario of batches, because they're batches, you should in theory have the ability to send the batch in parts smaller than 4GB and then have them be processed as a whole batch I'd have thought. Whilst not ideal due to the added latency of splitting, sending one after another, then recombining. And at least the latency will be less than it would be inferencing a single sequence at a time.

You've also got the benefit of still being able to do things, I have a scenario that becomes impossible due to this limitation. Which is having a model or individual sequence that's greater than 4GB, something that can't be split into parts then recombined on the GPU once it's been sent in parts smaller than 4GB... 😞

@uniartisan
Copy link

uniartisan commented May 17, 2024

I am very curious whether this problem has been solved in the latest stable version of ipex. The only purpose of buying a770 is to develop llm. My code can even run correctly on Moore thread s80 but cannot run on Intel's ipex. At some point I have to allocate large amounts of memory, especially when doing batch inference.

I'd very much be interested in hearing if this is fixed yet.

However you can still do something for now if it isn't: In the scenario of batches, because they're batches, you should in theory have the ability to send the batch in parts smaller than 4GB and then have them be processed as a whole batch I'd have thought. Whilst not ideal due to the added latency of splitting, sending one after another, then recombining. And at least the latency will be less than it would be inferencing a single sequence at a time.

You've also got the benefit of still being able to do things, I have a scenario that becomes impossible due to this limitation. Which is having a model or individual sequence that's greater than 4GB, something that can't be split into parts then recombined on the GPU once it's been sent in parts smaller than 4GB... 😞

Can I ask whether you are using WSL or directly using Ubuntu 22.04 or other systems? Because when I run my previous code in WSL (they have been verified on CUDA, CPU, and even Moorethread s80 GPU), problems often occur, and I have already submitted an issue for one of them. I would like to know what system you are using, to see if I need to replace the system. :)

@ElliottDyson
Copy link

I am very curious whether this problem has been solved in the latest stable version of ipex. The only purpose of buying a770 is to develop llm. My code can even run correctly on Moore thread s80 but cannot run on Intel's ipex. At some point I have to allocate large amounts of memory, especially when doing batch inference.

I'd very much be interested in hearing if this is fixed yet.

However you can still do something for now if it isn't: In the scenario of batches, because they're batches, you should in theory have the ability to send the batch in parts smaller than 4GB and then have them be processed as a whole batch I'd have thought. Whilst not ideal due to the added latency of splitting, sending one after another, then recombining. And at least the latency will be less than it would be inferencing a single sequence at a time.

You've also got the benefit of still being able to do things, I have a scenario that becomes impossible due to this limitation. Which is having a model or individual sequence that's greater than 4GB, something that can't be split into parts then recombined on the GPU once it's been sent in parts smaller than 4GB... 😞

Can I ask whether you are using WSL or directly using Ubuntu 22.04 or other systems? Because when I run my previous code in WSL (they have been verified on CUDA, CPU, and even Moorethread s80 GPU), problems often occur, and I have already submitted an issue for one of them. I would like to know what system you are using, to see if I need to replace the system. :)

I run directly in windows nowadays, but I've tried it directly in Ubuntu and WSL previously. Behaves the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARC ARC GPU Crash Execution crashes
Projects
None yet
Development

No branches or pull requests