Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'out of ressources error' returned from opencl code with nvidia cards with memory >8GB #30

Closed
mwibral opened this issue Feb 8, 2019 · 9 comments

Comments

@mwibral
Copy link
Collaborator

mwibral commented Feb 8, 2019

There is an issue with nividia cards with a memory of larger (!) than 8 GB ironically reporting an 'out of ressources' error sometime into the computation (e.g. when running systemtest_lorenz2_opencl.py). Cards of the same chip architecture with up to 8GB do not seem to have that problem, e.g.:
Cards running fine: Titan 1st gen. (Kepler, 6GB), GTX 1080 8GB (Pascal chip)
Cards returning errors: Quadro P6000 (24GB), Tesla V100 (32GB)

@pwollstadt
Copy link
Owner

Could not reproduce this on a GeForce GTX TITAN X with 12 GB main memory (running Ubuntu 14.04.5 LTS). Maybe collecting devices/setups that do and don't work helps to narrow down the list of possible causes?

@orlandi
Copy link

orlandi commented Mar 5, 2019

Maybe collecting devices/setups that do and don't work helps to narrow down the list of possible causes?

In case it helps, I was able to run systemtest_lorenz2_opencl.py succesfully on Gefoce RTX 2080 Ti (Turing, 11 GB) and on Tesla K20c (Kepler, 5GB) cards.

@orlandi
Copy link

orlandi commented Mar 7, 2019

I spoke too soon. Although the Lorenz code did run, I'm experiencing the same issue when using my own data. OUT_OF_RESOURCES error on the 2080 Ti (11 GB), but no problems on the Tesla K20c (5GB) or when using CPUs.
I'm running the standard multivariate TE with the OpenCLKraskovCMI CMI estimator:

from idtxl.multivariate_te import MultivariateTE
from idtxl.data import Data
from idtxl.visualise_graph import plot_network
from idtxl import idtxl_io as io
import matplotlib.pyplot as plt
import pickle
import numpy, scipy.io

data = io.import_matarray(
        file_name='test.mat',
        array_name='XR',
        dim_order='rps',
        file_version='v7.3',
        normalise=False)

network_analysis = MultivariateTE()
settings = {'cmi_estimator': 'OpenCLKraskovCMI',
            'max_lag_sources': 3,
            'min_lag_sources': 1}

results = network_analysis.analyse_network(settings=settings, data=data)

pickle.dump(results, open('results.p', 'wb'))

Data structure contains 16 processes, 46 samples, 1106 replications.
With 200 replications it runs fine, but with the above number it results in the following error on computing sources for the first target:


---------------------------- (2) include source candidates
candidate set: [(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3), (4, 1), (4, 2), (4, 3), (5, 1), (5, 2), (5, 3), (6, 1), (6, 2), (6, 3), (7, 1), (7, 2), (7, 3), (8, 1), (8, 2), (8, 3), (9, 1), (9, 2), (9, 3), (10, 1), (10, 2), (10, 3), (11, 1), (11, 2), (11, 3), (12, 1), (12, 2), (12, 3), (13, 1), (13, 2), (13, 3), (14, 1), (14, 2), (14, 3), (15, 1), (15, 2), (15, 3)]
testing candidate: (14, 1) maximum statistic, n_perm: 200
Traceback (most recent call last):
  File "multivariateTEtestR.py", line 29, in <module>
    results = network_analysis.analyse_network(settings=settings, data=data)
  File "/home/benuccilab/IDTxl/idtxl/multivariate_te.py", line 159, in analyse_network
    settings, data, targets[t], sources[t])
  File "/home/benuccilab/IDTxl/idtxl/multivariate_te.py", line 276, in analyse_single_target
    self._include_source_candidates(data)
  File "/home/benuccilab/IDTxl/idtxl/network_inference.py", line 826, in _include_source_candidates
    self._include_candidates(candidates, data)
  File "/home/benuccilab/IDTxl/idtxl/network_inference.py", line 120, in _include_candidates
    conditional=self._selected_vars_realisations)
  File "/home/benuccilab/IDTxl/idtxl/estimator.py", line 278, in estimate_parallel
    return self.estimate(n_chunks=n_chunks, **data)
  File "/home/benuccilab/IDTxl/idtxl/estimators_opencl.py", line 539, in estimate
    n_chunks_current_run)
  File "/home/benuccilab/IDTxl/idtxl/estimators_opencl.py", line 680, in _estimate_single_run
    cl.enqueue_copy(self.queue, distances, d_distances)
  File "/home/benuccilab/conda/envs/idtxl/lib/python3.7/site-packages/pyopencl/__init__.py", line 1709, in enqueue_copy
    return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
pyopencl._cl.RuntimeError: clEnqueueReadBuffer failed: OUT_OF_RESOURCES

I did check memory usage on the card, and was always very low, less than 1GB.

@mwibral
Copy link
Collaborator Author

mwibral commented Mar 7, 2019 via email

@mwibral
Copy link
Collaborator Author

mwibral commented Apr 3, 2019

Some more info, now that I am testing on multiple machines, including AMD ones.

(1) On two machines with VEGA 64 and AMD ROCm's OpenCL, I get from python /pyopencl:

Memory access fault by GPU node-1 (Agent handle: 0x564110c33270) on address 0xa02a00000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Note that the address 0xa02a0000 is identical on both systems, althought the cards are slightly different (regular vega64, 8GB, and WX9100, 16GB, radeon pro model)

dmesg returns:

gmc_v9_0_process_interrupt: 6 callbacks suppressed
[Wed Apr 3 18:05:00 2019] amdgpu 0000:19:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process pid 0 thread pid 0)
[Wed Apr 3 18:05:00 2019] amdgpu 0000:19:00.0: in page starting at address 0x0000000a02a00000 from 27
[Wed Apr 3 18:05:00 2019] amdgpu 0000:19:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00801030

(2) On an AMD APU with the amdgpu pro driver I simply get a system crash. This happens if I run more than 37 or 38 replications in the systemtest_lorenz2_opencl.py.

(3) Using the -older- develop version that Aaron is running in Frankfurt (and obtained from patricia via email I think?), I get a different error (and much earlier in the process):
pyopencl._cl.RuntimeError: clCreateSubBuffer failed: MISALIGNED_SUB_BUFFER_OFFSET

Googling for this last error messsage turns up posts (https://stackoverflow.com/questions/17575032/using-clcreatesubbuffer) where memory management should be done in relation to the device property CL_DEVICE_MEM_BASE_ADDR_ALIGN
(maybe we can use the padding we had constructed for the AMD S10000 in a more flexible way to satisfy these requirements)?

@georgedimitriadis
Copy link

Hi,

Just to chip in with the same error.

I am running windows 10 on an intel i7-5960X with 2 GeForce Titan X (12GB) cards.
I am on a miniconda / python 3.6 system and everything seem to be working fine (openCL drivers for everything, all dependencies up and ruinning, etc.).

I have a data set with 839 processes, 14422 samples, 1 replications.

I run the following code:
network_analysis = MultivariateTE()
settings_gpu = {'cmi_estimator': 'OpenCLKraskovCMI',
'gpuid': 1,
'max_lag_sources': 8,
'min_lag_sources': 1,
'max_lag_target': 4
}
results = network_analysis.analyse_network(settings=settings_gpu, data=data)

and I get the error:

Traceback (most recent call last):
File "", line 19, in
File "e:\software\develop\source\repos\idtxl\idtxl\multivariate_te.py", line 159, in analyse_network
settings, data, targets[t], sources[t])
File "e:\software\develop\source\repos\idtxl\idtxl\multivariate_te.py", line 276, in analyse_single_target
self._include_source_candidates(data)
File "e:\software\develop\source\repos\idtxl\idtxl\network_inference.py", line 826, in _include_source_candidates
self._include_candidates(candidates, data)
File "e:\software\develop\source\repos\idtxl\idtxl\network_inference.py", line 120, in _include_candidates
conditional=self._selected_vars_realisations)
File "e:\software\develop\source\repos\idtxl\idtxl\estimator.py", line 278, in estimate_parallel
return self.estimate(n_chunks=n_chunks, **data)
File "e:\software\develop\source\repos\idtxl\idtxl\estimators_opencl.py", line 539, in estimate
n_chunks_current_run)
File "e:\software\develop\source\repos\idtxl\idtxl\estimators_opencl.py", line 680, in estimate_single_run
cl.enqueue_copy(self.queue, distances, d_distances)
File "E:\Software\Develop\Languages\Pythons\Miniconda35\lib\site-packages\pyopencl_init
.py", line 1712, in enqueue_copy
return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
pyopencl._cl.RuntimeError: clEnqueueReadBuffer failed: OUT_OF_RESOURCES

Any help with this?

Thanks

pwollstadt pushed a commit that referenced this issue Aug 23, 2019
Replace unsigned int types in OpenCL/CUDA code. For very large point
sets this leads to an overflow and incorrect indexing of arrays.
Add test scripts.

Update CUDA makefile.

Fixes #30.
@mwibral
Copy link
Collaborator Author

mwibral commented Dec 12, 2019

An update on this issue (after the fix with the int index that's already included in the branch fix gpu_bug):

Unfortunately, there are still errors if the product of npointsdimchunks exceeds a certain threshold AND the padding is used (necessary), i.e. if the number of bytes (datapoints) that go to the GPU card is not a multiple of 1024. In that case the computation on the GPU runs (as seen by the time elapsed until the error), but there is a memory access violation when returning, leading to the following error messages:
(AMD) Memory access fault by GPU node-1 (Agent handle: 0x562731f06a00) on address 0xa06200000. Reason: Page not present or supervisor privilege.
(NVIDIA) clEnqueueReadBuffer failed: OUT_OF_RESOURCES

This does not happen when the data that goes to the GPU is a multiple of 1024, i.e. when we pad with zero points, or when we switch of padding (this only works for nvidia cards, see below).

Note that the padding is only necessary on cards that need manual subbuffer alignment (AMD cards). So on Nvidia cards a simple solution would be to detect the manufacturer and switch off the padding altogether.

On some AMD cards that only provide opencl1.2 capabilitites (e.g. Lexa XT chip and the old Hawaii chips) there seems to be no problem with the padding - for reasons unknown. So for AMD cards that provide only opencl1.2. capabilities the solution could be to detect the capabilities and to use the padding as is.

Btw. running the opencl code on a multicore GPU using Intel's opencl implementation also works (it's just 100x slower), so there are no really gross errors with the implementation of the actual opencl kernel, I guess.

The remaining problems on AMD cards with the rocm driver and opencl2.0 capabilities (definitely Vega, possibly Polaris and Fiji) need to be solved in the opencl code. It is also possible there is a opencl2.0 issue, possibly in pyopencl.

I would be very glad if someone else could confirm the above observations by:

  1. cloning the latest repo
  2. switching to the branch fix_gpu_bug
  3. cd to ..../IDTxl/dev/search_GPU/deliverable2_1
  4. run $> time python test_opencl_search.py --gpuid 0 -p 3670018 -d 2 -c 2 --padding
    (step 4 will take somewhere between 5 and 50 minutes, depending on your GPU, if it crashes you'll get an error message, the code will almost certainly not hang, just give it some time :-))

and then report:
(A) whether there was a crash or not, i.e. report the output of the above command
(B) OS, GPU type, GPU driver, GPU RAM size, the output of clinfo (includes the opencl1.2. vs opencl 2.0 capabilitites)
(C) results of other pointsizes, dimensions, chunk numbers

@Markwelt
Copy link

Not sure if this is still open or under consideration. But anyway...
On Windows with an NVIDIA GPU with 8GB mem with OpenCL 1.2, the test on deliverable2_1 fails, though demo_multivariate_te.py (with 'cmi_estimator': 'OpenCLKraskovCMI') works on the branch fix_gpu_bug (and not on the updated main master IDTxl). More info below.

OS: Win 10 (Enterprise ver 1909 build 18363.778); GPU: NVIDIA GeForce RTX 2070; pyOpenCL: pyopencl-2020.2+cl12-cp38-cp38-win_amd64.whl (max 1.2 on NVIDIA as you know)
python test_opencl_search.py --gpuid 0 -p 3670018 -d 2 -c 2 --padding gives:

gpuid: 0
Applying padding to multiple of 1024

pointset: 56.01, TOTAL: 112.02 MB, PADDING: 1020
pointset shape: (2, 7341056)
pointset shape % n_chunks: 0 (chunkkength: 3670018)

Selected Device: GeForce RTX 2070
DEBUG:pyopencl.cache:build program: binary cache hit (key: 8b09c644ab05a53347cd9f1e19d108a3)
DEBUG:pytools.persistent_dict:pyopencl-invoker-cache-v7: disk cache hit [key=9830e4e464ac850c53a15d29fb1ea0822392841646effb7ab7351bc85139fdf0]
INFO:pyopencl:build program: kernel 'kernelKNNshared' was part of a lengthy cache retrieval (0.50 s)
DEBUG:pytools.persistent_dict:pyopencl-invoker-cache-v7: disk cache hit [key=b0be9d43833a6a67bdabcaf743eafe38855ffce6660ec5502b12c36615acb806]
DEBUG:pytools.persistent_dict:pyopencl-invoker-cache-v7: disk cache hit [key=b4efd189bde0f677b380b90ad80abcdf9858b798940c0e9f9b0bfc8b6beb4077]
INFO:pyopencl:build program: kernel 'kernelBFRSAllshared' was part of a lengthy cache retrieval (0.50 s)
DEBUG:pytools.persistent_dict:pyopencl-invoker-cache-v7: disk cache hit [key=fb92d9fdefe95aeac01927058216ade6f72b3d21cc609bafdbf44c1b956191a5]
Pointset: 14682112 elements, dim 2x7341056, 2 chunks (chunklength: 3670018).
workitems_x: 256, NDrange_x: 7341056
device distances size: 58728448
host distances size: 14682112
clEnqueueReadBuffer failed: OUT_OF_RESOURCES
Execution time: 6.07 min

!!! GPU execution failed

No clinfo on win so below a shortened version of a GPU Caps Viewer report with more GPU and OpenCL info

===================================================
GPU Caps Viewer v1.45.1.0 report
===================================[ System / CPU ]

  • CPU Name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

  • CPU Core Speed: 3695 MHz

  • CPU logical cores: 12

  • Family: 6 - Model: 14 - Stepping: 10

  • Physical Memory Size: 65536 MB

  • Operating System: Windows 10 64-bit build 18363

  • PhysX Version: 9190218
    ===================================[ Graphics Adapters / GPUs ]

  • Current Display Mode: 1920x1080 @ 60 Hz - 32 bpp

  • Num GPUs: 1

  • GPU 1

    • Name: NVIDIA GeForce RTX 2070
    • GPU codename: TU106-400
    • Device ID: 10DE-1F02
    • Subdevice ID: 3842-2070
    • Revision ID: A1
    • GPU brand type: GeForce
    • Driver: 27.21.14.5122 (R451.22)
    • Shader cores: 2304
    • Texture units: 144
    • ROP units: 64
    • TDP: 175W
    • BIOS version: 90.06.18.40.0c
    • Memory size: 8191MB
    • Memory type: GDDR6
    • Memory bus width: 256-bit
    • GPU base clock: 1410 MHz
      ===================================[ OpenGL GPU Capabilities ]
  • GL_VENDOR: NVIDIA Corporation

  • GL_RENDERER: GeForce RTX 2070/PCIe/SSE2

  • GL_VERSION: 4.6.0 NVIDIA 451.22

  • GL_SHADING_LANGUAGE_VERSION: 4.60 NVIDIA
    ===================================[ NVIDIA CUDA Capabilities ]

  • CUDA Device 0

    • Device name: GeForce RTX 2070
    • PCI bus ID: 1
    • Compute Capability: 7.5
    • Total memory: 4095 MB
    • Peak memory bandwidth: 448 GB/s
    • L2 cache: 4 MB
    • Core clock rate: 1620 MHz
    • Memory clock rate: 7001 MHz
    • Multiprocessors (SMs): 36
    • CUDA cores per SM: 64
    • CUDA cores: 2304
    • Async engines: 3
    • Warp Size: 32
    • Max Threads Per Block: 1024
    • Threads Per Block: 1024 x 1024 x 64
    • Grid Size: 2147483647 x 65535 x 65535
    • Registers Per Block: 65536
    • Texture Alignment: 512 byte
    • Total Constant Memory: 64 Kb
      ===================================[ OpenCL Capabilities ]
  • Num OpenCL platforms: 1

  • CL_PLATFORM_NAME: NVIDIA CUDA

  • CL_PLATFORM_VENDOR: NVIDIA Corporation

  • CL_PLATFORM_VERSION: OpenCL 1.2 CUDA 11.0.186

  • CL_PLATFORM_PROFILE: FULL_PROFILE

  • Num devices: 1

    • CL_DEVICE_NAME: GeForce RTX 2070
    • CL_DEVICE_VENDOR: NVIDIA Corporation
    • CL_DRIVER_VERSION: 451.22
    • CL_DEVICE_PROFILE: FULL_PROFILE
    • CL_DEVICE_VERSION: OpenCL 1.2 CUDA
    • CL_DEVICE_TYPE: GPU
    • CL_DEVICE_VENDOR_ID: 0x10DE
    • CL_DEVICE_MAX_COMPUTE_UNITS: 36
    • CL_DEVICE_MAX_CLOCK_FREQUENCY: 1620MHz
    • CL_NV_DEVICE_COMPUTE_CAPABILITY_MAJOR: 7
    • CL_NV_DEVICE_COMPUTE_CAPABILITY_MINOR: 5
    • CL_NV_DEVICE_REGISTERS_PER_BLOCK: 65536
    • CL_NV_DEVICE_WARP_SIZE: 32
    • CL_NV_DEVICE_GPU_OVERLAP: 1
    • CL_NV_DEVICE_KERNEL_EXEC_TIMEOUT: 1
    • CL_NV_DEVICE_INTEGRATED_MEMORY: 0
    • CL_DEVICE_ADDRESS_BITS: 32
    • CL_DEVICE_MAX_MEM_ALLOC_SIZE: 2097152KB
    • CL_DEVICE_GLOBAL_MEM_SIZE: 8192MB
    • CL_DEVICE_MAX_PARAMETER_SIZE: 4352
    • CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 128 Bytes
    • CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 1152KB
    • CL_DEVICE_ERROR_CORRECTION_SUPPORT: NO
    • CL_DEVICE_LOCAL_MEM_TYPE: Local (scratchpad)
    • CL_DEVICE_LOCAL_MEM_SIZE: 48KB
    • CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64KB
    • CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
    • CL_DEVICE_MAX_WORK_ITEM_SIZES: [1024 ; 1024 ; 64]
    • CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
    • CL_EXEC_NATIVE_KERNEL: 6333188
    • CL_DEVICE_IMAGE_SUPPORT: YES
    • CL_DEVICE_MAX_READ_IMAGE_ARGS: 256
    • CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 32
    • CL_DEVICE_IMAGE2D_MAX_WIDTH: 32768
    • CL_DEVICE_IMAGE2D_MAX_HEIGHT: 32768
    • CL_DEVICE_IMAGE3D_MAX_WIDTH: 16384
    • CL_DEVICE_IMAGE3D_MAX_HEIGHT: 16384
    • CL_DEVICE_IMAGE3D_MAX_DEPTH: 16384
    • CL_DEVICE_MAX_SAMPLERS: 32
    • CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 1
    • CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: 1
    • CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
    • CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
    • CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
    • CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 1
    • CL_DEVICE_EXTENSIONS: 17
    • Extensions:
      • cl_khr_global_int32_base_atomics
      • cl_khr_global_int32_extended_atomics
      • cl_khr_local_int32_base_atomics
      • cl_khr_local_int32_extended_atomics
      • cl_khr_fp64
      • cl_khr_byte_addressable_store
      • cl_khr_icd
      • cl_khr_gl_sharing
      • cl_nv_compiler_options
      • cl_nv_device_attribute_query
      • cl_nv_pragma_unroll
      • cl_nv_d3d10_sharing
      • cl_khr_d3d10_sharing
      • cl_nv_d3d11_sharing
      • cl_nv_copy_opts
      • cl_nv_create_buffer
      • cl_khr_int64_base_atomics

pwollstadt pushed a commit that referenced this issue Nov 16, 2020
Replace unsigned int types in OpenCL/CUDA code. For very large point
sets this leads to an overflow and incorrect indexing of arrays.
Add test scripts.

Update CUDA makefile.

Fixes #30.
pwollstadt pushed a commit that referenced this issue Nov 17, 2020
Replace unsigned int types in OpenCL/CUDA code. For very large point
sets this leads to an overflow and incorrect indexing of arrays.
Add test scripts.

Update CUDA makefile.

Fixes #30.
@mwibral
Copy link
Collaborator Author

mwibral commented Dec 9, 2020

I have uploaded a preliminary bugfix for this problem. See branch OpenCL_bugfix. Testing is appreciated.

@mwibral mwibral closed this as completed Dec 9, 2020
pwollstadt pushed a commit that referenced this issue Feb 25, 2021
…riables: signallength_padded and signallength orig, I set padding default to true, made callers aware of additional argument in opencl kernels

Fixes #30.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants