Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlocks when accessing Context with active GL interop #235

Open
s-ol opened this issue May 26, 2018 · 10 comments
Open

Deadlocks when accessing Context with active GL interop #235

s-ol opened this issue May 26, 2018 · 10 comments

Comments

@s-ol
Copy link
Contributor

s-ol commented May 26, 2018

I'm having trouble porting my array-based code to an interop-based renderer.

I'm instantating the array like this:
using an allocator:

		def gl_buffer_allocator(size):
			ubo = glGenBuffers(1)
			glBindBuffer(GL_UNIFORM_BUFFER, ubo)
			glBufferStorage(GL_UNIFORM_BUFFER, size, None, GL_MAP_READ_BIT | GL_MAP_WRITE_BIT)
			glBindBuffer(GL_UNIFORM_BUFFER, 0)
			return GLBuffer(ctx, mem_flags.READ_WRITE, int(ubo))

to_device cannot work because it doesn't acquire the GLBuffer, and I cannot do that beforehand since the buffer isn't allocated yet.
It works like this:

  		self.grid = Array(queue, self.grid_array.shape, self.grid_array.dtype, allocator=allocator)
  		self.grid.queue = None # didn't want to associate a queue yet
  		self.grid.allocator = None # make sure `.get()` doesn't allocate GLBuffers

for some reason passing a Context instead of a CommandQueue makes this lock up here. Freeing the context seems like the wrong thing to do...?

#0  0x00007fffeb5749aa in ?? () from /usr/lib/libnvidia-glcore.so.396.24
#1  0x00007fffeb1b2190 in ?? () from /usr/lib/libnvidia-glcore.so.396.24
#2  0x00007fffeb4c9f32 in ?? () from /usr/lib/libnvidia-glcore.so.396.24
#3  0x00007fffec70b6a3 in glcuR0d4nX () from /usr/lib/libGLX_nvidia.so.0
#4  0x00007fffe8853544 in ?? () from /usr/lib/libnvidia-opencl.so.1
#5  0x00007fffe8752881 in ?? () from /usr/lib/libnvidia-opencl.so.1
#6  0x00007fffe8751595 in ?? () from /usr/lib/libnvidia-opencl.so.1
#7  0x00007ffff032eb94 in clReleaseContext () from /usr/lib/libOpenCL.so.1
#8  0x00007ffff0572d95 in context::~context() () from /usr/lib/python3.6/site-packages/pyopencl/_cffi.abi3.so
#9  0x00007ffff057303a in context::~context() () from /usr/lib/python3.6/site-packages/pyopencl/_cffi.abi3.so
#10 0x00007ffff055f9fd in ?? () from /usr/lib/python3.6/site-packages/pyopencl/_cffi.abi3.so

The same deadlock is preventing me from using grid.with_queue(), grid.setitem() etc.
I realized later that I can trigger it just by accessing the context attribute of a CommandQueue:

	def step(self):
		with CommandQueue(self.ctx) as queue:
			cl.enqueue_acquire_gl_objects(queue, [self.grid.base_data])
                        # uncomment to lock
			# queue.context 
			self.grid.set(self.grid_array, queue=queue)
			cl.enqueue_acquire_gl_objects(queue, [self.grid.base_data])
		print('got here')

interestingly it runs until 'got here' but I never see the result of the set call. The step() method also never returns for me.
If I debug the script in pudb, the interface closes as I step out of the method.

I'll see if i can create a small reproducable example now.

@s-ol
Copy link
Contributor Author

s-ol commented May 26, 2018

here we go:

from OpenGL.GL import *
from OpenGL.GLUT import *
import pyopencl as cl
import pyopencl.array
import numpy as np

def get_ctx():
	from pyopencl.tools import get_gl_sharing_context_properties
	import sys

	platform = cl.get_platforms()[0]

	if sys.platform == "darwin":
		return cl.Context(properties=get_gl_sharing_context_properties(),
				devices=[])
	else:
		# Some OSs prefer clCreateContextFromType, some prefer
		# clCreateContext. Try both.
		try:
			return cl.Context(properties=[
				(cl.context_properties.PLATFORM, platform)]
				+ get_gl_sharing_context_properties())
		except:
			return cl.Context(properties=[
				(cl.context_properties.PLATFORM, platform)]
				+ get_gl_sharing_context_properties(),
				devices = [platform.get_devices()[0]])

			glutInit()

def gl_allocator(size):
	ubo = glGenBuffers(1)
	glBindBuffer(GL_UNIFORM_BUFFER, ubo)
	glBufferStorage(GL_UNIFORM_BUFFER, size, None, GL_MAP_READ_BIT | GL_MAP_WRITE_BIT)
	glBindBuffer(GL_UNIFORM_BUFFER, 0)
	return cl.GLBuffer(ctx, cl.mem_flags.READ_WRITE, int(ubo))

glutInit()
glutInitWindowSize(512, 512)
glutCreateWindow('gpWFC')
glutDisplayFunc(lambda: 0)

ctx = get_ctx()
data = np.arange(100)
with cl.CommandQueue(ctx) as queue:
	arr = cl.array.Array(queue, data.shape, data.dtype, allocator=gl_allocator)

def key(*args):
	print("key pressed")
	with cl.CommandQueue(ctx) as queue:
		cl.enqueue_acquire_gl_objects(queue, [arr.base_data])
		queue.context
		arr.set(data, queue=queue)
		cl.enqueue_release_gl_objects(queue, [arr.base_data])
glutKeyboardFunc(key)
glutMainLoop()

let this open, press any key once and it should lock up.
My system info is in this comment.

@inducer
Copy link
Owner

inducer commented May 30, 2018

It works like this:

I'd discourage in-place modification of Array instances. Instead, simply pass your buffer to the constructor via the data= kwarg.

Freeing the context seems like the wrong thing to do...?

That's weird. OpenCL is reference counted, so all clReleaseContext should do is decrease the refcount--unless that was indeed the last reference to the context.

@s-ol
Copy link
Contributor Author

s-ol commented May 31, 2018

@inducer I tried that but it also triggered the hang. Maybe clReleaseContext is only decreasing the reference and there is something else going on - I assumed whats in the title from the backtrace only.

If you don't have time to look into this, could you recommend a debugging strategy?

EDIT: leaving this link here for reference, I'll check my dmesg output next time and also see if I can get a test setup on Windows.

@s-ol
Copy link
Contributor Author

s-ol commented Oct 21, 2018

@inducer have you had a chance to take a look at the minimal example I provided above?

@s-ol s-ol changed the title Deadlocks because Context is being freed (Arrays with GL interop) Deadlocks when accessing Context with active GL interop Oct 21, 2018
@inducer
Copy link
Owner

inducer commented Oct 21, 2018

I have not, sorry. But you may want to retry with git master, which is a whole different code base (actually, mostly a revival of the old Boost.Python code on top of pybind11).

@s-ol
Copy link
Contributor Author

s-ol commented Oct 29, 2018

I have not, sorry. But you may want to retry with git master, which is a whole different code base (actually, mostly a revival of the old Boost.Python code on top of pybind11).

Great, I've given it a shot but I am experiencing some issues with NVIDIA Optimus / Bumblebee on my laptop: Bumblebee-Project/Bumblebee#778

Xlib:  extension "NV-GLX" missing on display ":0"

Having dealt with these things in the past though, I think the fix is just waiting until I get back to my desktop PC where optimus doesn't get in the way.

@s-ol
Copy link
Contributor Author

s-ol commented Jan 16, 2019

Unfortunately still experiencing the same problem:

(gdb) bt
#0  0x00007f98d5c76853 in  () at /usr/lib/libnvidia-glcore.so.415.25
#1  0x00007f98b0999478 in  ()
#2  0x00007ffd5f107d58 in  ()
#3  0x00007ffd5f107d58 in  ()
#4  0x000055e2a22f3870 in  ()
#5  0x00007f98d6cf0ebd in  () at /usr/lib/libGLX_nvidia.so.0
#6  0x00007f98d5c3901d in  () at /usr/lib/libnvidia-glcore.so.415.25
#7  0x00007f98d5bf0d02 in  () at /usr/lib/libnvidia-glcore.so.415.25
#8  0x00007f98d6cb8033 in glcuR0d4nX () at /usr/lib/libGLX_nvidia.so.0
#9  0x00007f98d2e1d794 in  () at /usr/lib/libnvidia-opencl.so.1
#10 0x00007f98d2d1b7d1 in  () at /usr/lib/libnvidia-opencl.so.1
#11 0x00007f98d2d1a4e5 in  () at /usr/lib/libnvidia-opencl.so.1
#12 0x00007f98d92deef4 in clReleaseContext () at /usr/lib/libOpenCL.so.1
#13 0x00007f98d8cdad8b in std::_Sp_counted_ptr<pyopencl::context*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
    at /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#14 0x00007f98d8cda6a4 in pybind11::class_<pyopencl::context, std::shared_ptr<pyopencl::context> >::dealloc(pybind11::detail::value_and_holder&) ()
    at /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#15 0x00007f98d8cce01f in pybind11_object_dealloc ()
    at /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#16 0x00007f98e4e5dd9e in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#17 0x00007f98e4d9eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#18 0x00007f98e4de5492 in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#19 0x00007f98e4e57c42 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#20 0x00007f98e4d9eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#21 0x00007f98e4d9fdec in _PyFunction_FastCallDict () at /usr/lib/libpython3.7m.so.1.0
#22 0x00007f98e4e5943c in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#23 0x00007f98e4d9eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#24 0x00007f98e4de5492 in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#25 0x00007f98e4e58b7d in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#26 0x00007f98e4d9fc0b in _PyFunction_FastCallDict () at /usr/lib/libpython3.7m.so.1.0
#27 0x00007f98e4e5943c in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#28 0x00007f98e4d9eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#29 0x00007f98e4de5492 in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
--Type <RET> for more, q to quit, c to continue without paging--

This is my own code, but interestingly enough I now have the same problem running examples/gl_interop_demo.py:

(gdb) bt
#0  0x00007fffe8b11896 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#1  0x00007fffe8b3e5fc in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#2  0x00007fffe87657b0 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#3  0x00007fffe8a8bd02 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#4  0x00007fffe9b53033 in glcuR0d4nX () from /usr/lib/libGLX_nvidia.so.0
#5  0x00007fffe5cd8794 in ?? () from /usr/lib/libnvidia-opencl.so.1
#6  0x00007fffe5bd67d1 in ?? () from /usr/lib/libnvidia-opencl.so.1
#7  0x00007fffe5bd54e5 in ?? () from /usr/lib/libnvidia-opencl.so.1
#8  0x00007ffff7203ef4 in clReleaseContext () from /usr/lib/libOpenCL.so.1
#9  0x00007ffff5411d8b in std::_Sp_counted_ptr<pyopencl::context*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#10 0x00007ffff54116a4 in pybind11::class_<pyopencl::context, std::shared_ptr<pyopencl::context> >::dealloc(pybind11::detail::value_and_holder&) ()
   from /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#11 0x00007ffff540501f in pybind11_object_dealloc ()
   from /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#12 0x00007ffff7b664c0 in _PyFunction_FastCallKeywords () from /usr/lib/libpython3.7m.so.1.0
#13 0x00007ffff7bd8dfa in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.7m.so.1.0
#14 0x00007ffff7b1fb99 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.7m.so.1.0
#15 0x00007ffff7b20ab4 in PyEval_EvalCodeEx () from /usr/lib/libpython3.7m.so.1.0
#16 0x00007ffff7b20adc in PyEval_EvalCode () from /usr/lib/libpython3.7m.so.1.0
#17 0x00007ffff7c4ac94 in ?? () from /usr/lib/libpython3.7m.so.1.0
#18 0x00007ffff7c4c8be in PyRun_FileExFlags () from /usr/lib/libpython3.7m.so.1.0
#19 0x00007ffff7c4dc75 in PyRun_SimpleFileExFlags () from /usr/lib/libpython3.7m.so.1.0
#20 0x00007ffff7c4feb7 in ?? () from /usr/lib/libpython3.7m.so.1.0
#21 0x00007ffff7c500fc in _Py_UnixMain () from /usr/lib/libpython3.7m.so.1.0
#22 0x00007ffff7dae223 in __libc_start_main () from /usr/lib/libc.so.6
#23 0x000055555555505e in _start ()

However examples/gl_particle_animation.py works fine...

@inducer
Copy link
Owner

inducer commented Jan 17, 2019

What are the differences in the context setup code between examples/gl_interop_demo.py and examples/gl_particle_animation.py? What happens if you graft the context setup code from one onto the other?

@s-ol
Copy link
Contributor Author

s-ol commented Jan 19, 2019

in examples/gl_particle_animation.py the context is created simply by

platform = cl.get_platforms()[0]
ctx = cl.Context(properties=[(cl.context_properties.PLATFORM, platform)] + get_gl_sharing_context_properties())  

while in examples/gl_interop_demo.py there is this a bit more elaborate block:

platform = cl.get_platforms()[0]

from pyopencl.tools import get_gl_sharing_context_properties
import sys
if sys.platform == "darwin":
    ctx = cl.Context(properties=get_gl_sharing_context_properties(),
            devices=[])
else:
    # Some OSs prefer clCreateContextFromType, some prefer
    # clCreateContext. Try both.
    try:
        ctx = cl.Context(properties=[
            (cl.context_properties.PLATFORM, platform)]
            + get_gl_sharing_context_properties())
    except:
        ctx = cl.Context(properties=[
            (cl.context_properties.PLATFORM, platform)]
            + get_gl_sharing_context_properties(),
            devices = [platform.get_devices()[0]])

replacing the second with the first doesn't change the outcome though:

(gdb) bt
#0  0x00007fffe8b3e5f2 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#1  0x00007fffe87657b0 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#2  0x00007fffe8a8bd02 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#3  0x00007fffe9b53033 in glcuR0d4nX () from /usr/lib/libGLX_nvidia.so.0
#4  0x00007fffe5cd8794 in ?? () from /usr/lib/libnvidia-opencl.so.1
#5  0x00007fffe5bd67d1 in ?? () from /usr/lib/libnvidia-opencl.so.1
#6  0x00007fffe5bd54e5 in ?? () from /usr/lib/libnvidia-opencl.so.1
#7  0x00007ffff7203ef4 in clReleaseContext () from /usr/lib/libOpenCL.so.1
#8  0x00007ffff5411d8b in std::_Sp_counted_ptr<pyopencl::context*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#9  0x00007ffff54116a4 in pybind11::class_<pyopencl::context, std::shared_ptr<pyopencl::context> >::dealloc(pybind11::detail::value_and_holder&) ()
   from /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#10 0x00007ffff540501f in pybind11_object_dealloc ()

Also I finally managed to load the python GDB utils but it doesn't give any more information (because my python version is not compiled for debugging I assume):

(gdb) thread apply all py-bt-full

Thread 11 (Thread 0x7fffe1554700 (LWP 31451)):
Unable to locate python frame

Thread 10 (Thread 0x7fffe1d55700 (LWP 31450)):
Unable to locate python frame

Thread 9 (Thread 0x7fffe2556700 (LWP 31449)):
Unable to locate python frame

Thread 8 (Thread 0x7fffe2d57700 (LWP 31448)):
Unable to locate python frame

Thread 7 (Thread 0x7fffe3558700 (LWP 31447)):
Unable to locate python frame

Thread 6 (Thread 0x7fffe3f61700 (LWP 31446)):
#0 Waiting for the GIL

Thread 5 (Thread 0x7fffe4762700 (LWP 31445)):
Unable to locate python frame

Thread 4 (Thread 0x7fffedb81700 (LWP 31436)):
Unable to locate python frame

Thread 3 (Thread 0x7ffff2382700 (LWP 31435)):
Unable to locate python frame

Thread 2 (Thread 0x7ffff2b83700 (LWP 31434)):
Unable to locate python frame

Thread 1 (Thread 0x7ffff7883600 (LWP 31416)):
#12 (unable to read python frame information)

@inducer
Copy link
Owner

inducer commented Jan 20, 2019

So the fact that the backtrace contains clReleaseContext points to the notion that the Nvidia runtime has some bug that makes it not like decreasing the context refcount (perhaps: doing so while GL interop is still active). Something to try would be to make sure you hold on to a handle to the context somewhere, to make sure it doesn't get released prematurely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants