Threaded OpenGL API calls #1452

fzurita · 2017-04-02T16:36:36Z

This is the ongoing work to get a threaded implementation of the OpenGL API going.

This is not threaded yet, but it's the refactoring I have had to do so far that has been needed to support that.

Currently this does seem to have a performance regression, I have to figure out where it's coming from.

fzurita · 2017-04-02T22:06:30Z

The performance penalty comes from the unbuffered drawer. Probably because I can't cache the vertices any more.

I think I know how I will fix this. I will use the old version of the unbuffered drawer when in single threaded mode. I will use the current version in threaded mode or the buffered drawer if the hardware allows.

Edit:

Ok, done. I made a "thread safe" version of the UnbufferedDrawer. The plugin runs about ~25% slower using that version. This will be used in threaded mode whenever the hardware doesn't support Buffer Storage extension. I can't think of a clean way to fix that for devices without the extension, hopefully the threading makes up for it for those devices.

Edit2:
I was able to come up with an optimization that is thread safe and gives me 95% of the performance of the non thread safe way of doing it.

fzurita · 2017-04-04T06:43:22Z

Ok, I enabled threaded mode and of course it didn't work right off the bat.

I'm getting INVALID_OPERATION on glMapBufferRangeCommand. This doesn't happen when not in threaded mode with the same arguments. Here are the arguments:

target=8892 offset=0 length=4194304 access=c2

Anyone have any ideas?

gonetz · 2017-04-04T07:46:49Z

https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glMapBufferRange.xhtml

There are many cases for INVALID_OPERATION. I'd check "if zero is bound to target" and "is already in a mapped state" first.

fzurita · 2017-04-04T12:14:37Z

I found at least that problem. There were instances of the old unwrapped gl calls in the code still left. GET_GL_FUNCTION to be more precise. Now I just need to figure out how to fix those.

fzurita · 2017-04-04T12:59:40Z

Ok, good news, it works! For some reason the FPS display is not working so I can't judge performance. And for some reason the thread is not exiting when core exits.

Also, I expect performance to be identical at first since there are many operations that we are doing that are stalling the pipeline. So until I fix those, I don't expect any improvements.

Edit: I think a good solution for now would be if there is more than 1 swap buffers call in the queue, then prevent the main thread from executing until the video plugin catches up. I think that will work pretty well.

fzurita · 2017-04-05T04:39:13Z

Ok, after performing several optimizations on the threading. I'm seeing about a 25% performance improvement in FPS in CBFD. I'm thinking that the percentage improvement in FPS is going to correspond to how much time the emulation spends in the core.

AmbientMalice · 2017-04-05T04:43:15Z

I'm seeing about a 25% performance improvement in FPS in CBFD.

Is this on desktop or mobile, and a 25% improvement on what baseline? (Unthreaded, or unoptimized threaded?)

fzurita · 2017-04-05T04:44:17Z

This is in mobile. Unthreaded vs threaded. Threaded has 25% better performance in that game.

loganmc10 · 2017-04-05T04:54:32Z

Let me know when you feel it's ready for testing (and how to turn it on and off). I suspect it might be driver dependant. I think Nvidia has threaded optimizations built into the driver, so this might just add extra overhead on top of that, but testing will reveal that.

AmbientMalice · 2017-04-05T04:56:03Z

Isn't AMD notoriously poor at threading? This might help reduce CPU overheads on AMD.

fzurita · 2017-04-05T04:58:01Z

Right, I don't think it's ready yet. There are a few bugs I need to work out.

For example, in Mario 64, the core runs so fast when fast forwarding, that the OpenGL command queue keeps growing indefinitely. I need to figure out a way to drop non essential commands and skip frames.

fzurita · 2017-04-05T05:05:03Z

This will only help improve performance by allowing the GPU to keep doing work while the video plugin is not executing. So if the emulator spends 80% of the time in the video plugin, we can get an additional 20% performance.

Also, this will not help at all if we have to do read backs from video memory to CPU memory in sync mode. For example, if we need to read the color buffer in sync. If we read the color buffer in async mode though, this will help.

loganmc10 · 2017-04-05T13:26:46Z

For example, in Mario 64, the core runs so fast when fast forwarding, that the OpenGL command queue keeps growing indefinitely. I need to figure out a way to drop non essential commands and skip frames

Isn't this essentially the problem with running the OpenGL calls in their own thread? Won't this de-sync the video from the rest of the emulator? I'm still having a hard time understanding how this would improve performance without causing other issues with the emulation

loganmc10 · 2017-04-05T14:09:01Z

src/Graphics/OpenGLContext/opengl_ContextImpl.cpp

@@ -60,7 +62,7 @@ void ContextImpl::init()
 	}

 	{
-		if ((m_glInfo.isGLESX && (m_glInfo.bufferStorage && m_glInfo.majorVersion * 10 + m_glInfo.minorVersion > 32)) || !m_glInfo.isGLESX)
+		if ((m_glInfo.isGLESX && m_glInfo.bufferStorage) || !m_glInfo.isGLESX)


Why was m_glInfo.majorVersion * 10 + m_glInfo.minorVersion > 32 removed?

BufferedDrawer depends on glDrawElementsBaseVertex, which is only available in GLES 3.2+

Yeah, good point. I should put that back.

loganmc10 · 2017-04-05T14:10:27Z

src/Graphics/OpenGLContext/opengl_UnbufferedDrawer.cpp

-}
+	FunctionWrapper::glLineWidth(_width);
+	FunctionWrapper::glDrawArrays(GL_LINES, 0, 2);
+}


Needs a newline here

loganmc10 · 2017-04-05T14:10:39Z

src/Graphics/OpenGLContext/opengl_UnbufferedDrawer.h

@@ -26,4 +26,4 @@ namespace opengl {
 		std::array<const void*, MaxAttribIndex> m_attribsData;
 	};

-}
+}


Needs a newline here

loganmc10 · 2017-04-05T14:11:07Z

src/Graphics/OpenGLContext/opengl_WrappedFunctions.cpp

+namespace opengl {
+
+	std::array<std::shared_ptr<std::vector<char>>, MaxAttribIndex> GlVertexAttribPointerUnbufferedCommand::m_attribsData;
+}


Needs a newline here

loganmc10 · 2017-04-05T14:11:52Z

src/Graphics/OpenGLContext/opengl_WrappedFunctions.h

+			::CoreVideo_GL_SwapBuffers();
+		}
+	};
+}


Needs a newline here

loganmc10 · 2017-04-05T14:12:40Z

src/Graphics/OpenGLContext/opengl_Wrapper.h

+	{
+		executeCommand(std::make_shared<GlTextureSubImage2DUnbufferedCommand<pixelType>>(texture, level, xoffset, yoffset, width, height, format, type, std::move(pixels)));
+	}
+}


fzurita · 2017-04-05T15:23:04Z

By the way, this should allow us to implement frame skipping based on command queue size. So if there is more than one "swap buffers" command in the queue. We should be able to drop some commands to catch up. I just need to figure out what commands are safe to drop.

fzurita · 2017-04-06T06:02:52Z

Ok, I'm making a lot of head way. I fixed numerous bugs. Also, before swapping buffers, we will check if there is already a queued swap buffer ahead in the queue. If there is, we will wait until that is executed before continuing.

This prevents the situation where the GL command queue gets too far behind the emulation core. In the long run though, I think we want to skip frames.

Here are a few more things I need to do to call this complete:

Add project 64 support (I may need some help with this since I'm not familiar with the API)
Make this setting configurable
Figure out why single threaded performance is poor in some situations. I can always fallback to some of the old code for this. (I may have been measuring wrong, I can't find a performance reduction now)
Add async glReadPixels support, right now all glReadPixels calls will for the command queue to empty, eliminating the benefit of threading.
If BufferedDrawer is not supported, don't use UnbufferedDrawer while using theads, instead use UnbufferedDrawerThreadSafe
Take care of any comments or additional issues that show up.
Optimize UnbufferedDrawerThreadSafe. It's way too slow, which causes devices that don't sorry the buffered drawer to actually run sorry in threaded mode.

I may add to this as I see problems.

fzurita · 2017-04-06T13:07:06Z

@loganmc10 Can you give this a test? You can enable this by setting "ThreadedVideo" to true in mupen64plus.

Edit: Also, keep in mind that this only works mupen64plus and I have yet to update the CMake file.

dankcushions · 2017-12-24T13:39:18Z

Here goes! Now, for whatever reason I've never been able to get VerticalSync = False to work on my raspberry - it always seems like it's synced (never goes above ~60), so it's difficult to tell performance gains, but I can usually check regressions. Current Master results first, ThreadedVideo = True results in bold

Mario 64
during mario's face intro
59VIS 57-61
29FPS 25-30
(about the same)

idling outside castle
57-61VIS 58-61
28-31FPS 28-31
(about the same)

inside front door
57-61VIS 58-61
28-31FPS 27-31
(about the same)

idling at start of bob-omb battlefield
42-45VIS 44-53
22-25FPS 23-25
(slightly better?)

Goldeneye 64
(curiously in both, all wall and floor textures were unfiltered)

during bond's gun barrel walk.
29VIS 28
28FPS 28
(about the same)

dam intro pan
60 VIS 60
25 FPS 27
(about the same)

idling at start of dam
44-51VIS 25-63
22-27FPS 22-31
(slightly better?)

I'd say there's no regressions and it seems like a slight improvement, but I feel the benchmarking I am able to do is not definitive :( I wonder if @psyke83 or @gizmo98 can think of a better way?

fzurita · 2017-12-24T14:07:13Z

I'm glad to hear that at least it's not a performance penalty. I think implementing the object pool helped with that.

I think this code really helps when Async color to RDRAM is enabled which the RPI doesn't have.

Configuration can only be set using mupen64plus for now

fzurita · 2019-01-26T19:58:30Z

Rebased against latest master.

Tested a little bit and the performance boost seems to still be there, specially with Async Color buffer to RDRAM enabled. Around a 10% performance improvement on my device, but it will depend on how fast color buffer copies are.

With WGL function creation must be done in the same thread as the context.

The incorrect assumption was made that the pluging would start clean every time.

So that threading mode can be changed on the fly

GL commands must now be created using a "get" method

gonetz · 2019-01-28T11:39:47Z

Merged to fzurita-threaded_GLideN64 branch.

fzurita force-pushed the threaded_GLideN64 branch 4 times, most recently from 942cf85 to 9eca4bf Compare April 4, 2017 04:14

fzurita force-pushed the threaded_GLideN64 branch 3 times, most recently from 1d51917 to c55b2e2 Compare April 5, 2017 06:41

loganmc10 reviewed Apr 5, 2017

View reviewed changes

src/Graphics/OpenGLContext/opengl_WrappedFunctions.h Outdated

::CoreVideo_GL_SwapBuffers();

}

};

}

Copy link

Contributor

loganmc10 Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a newline here

loganmc10 reviewed Apr 5, 2017

View reviewed changes

fzurita force-pushed the threaded_GLideN64 branch from c55b2e2 to a0b8720 Compare April 6, 2017 05:12

fzurita mentioned this pull request Mar 28, 2018

Running OpenGL API calls on a separate thread? #1441

Closed

gonetz force-pushed the master branch from dbc9a06 to 087ac76 Compare November 14, 2018 11:08

fzurita added 3 commits January 25, 2019 16:54

Added simple blocking queue implementation

58a44a7

Implement openGL wrapper that allows for threading

a889979

Add configuration to turn threaded OpenGL on or off

cc71d33

Configuration can only be set using mupen64plus for now

fzurita force-pushed the threaded_GLideN64 branch from 4986397 to bc518a1 Compare January 26, 2019 19:57

fzurita force-pushed the threaded_GLideN64 branch 2 times, most recently from d28eb0c to c59c65b Compare January 26, 2019 21:17

fzurita added 15 commits January 26, 2019 16:20

Refactor code to use new OpenGL wrapper

181c756

Fix GL function init failure in windows

e6e2331

With WGL function creation must be done in the same thread as the context.

Enable GPU thread limiter for Zilmar

84920f5

Added constant to control how stringent the core thread limiter is

44bcec4

Fix issue with restarting emulation thread

1d82ed1

The incorrect assumption was made that the pluging would start clean every time.

Fix issue that could cause color buffers to be the wrong size when used

be9aa3d

Fix code formatting issues

ab25fda

Convert const references to pass by value

9c2b012

Reduce buffered drawer buffer size to 4 MB

83bbddb

Always set threading mode

2bca7c9

So that threading mode can be changed on the fly

Added toggle for threaded video to GLideNUI

26ef84d

Fix some issues found by Valgrind

5870e6d

In preparation for memory pool refactor GL commands

4f1bb22

GL commands must now be created using a "get" method

Object pool implementation

c1707e5

Fix issues after rebase to latest master

618d3c7

fzurita force-pushed the threaded_GLideN64 branch from c59c65b to 618d3c7 Compare January 26, 2019 21:20

fzurita mentioned this pull request Feb 23, 2019

Threaded OpenGL API calls #2014

Merged

fzurita closed this Feb 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threaded OpenGL API calls #1452

Threaded OpenGL API calls #1452

fzurita commented Apr 2, 2017 •

edited

fzurita commented Apr 2, 2017 •

edited

fzurita commented Apr 4, 2017

gonetz commented Apr 4, 2017

fzurita commented Apr 4, 2017

fzurita commented Apr 4, 2017 •

edited

fzurita commented Apr 5, 2017 •

edited

AmbientMalice commented Apr 5, 2017

fzurita commented Apr 5, 2017

loganmc10 commented Apr 5, 2017

AmbientMalice commented Apr 5, 2017

fzurita commented Apr 5, 2017

fzurita commented Apr 5, 2017

loganmc10 commented Apr 5, 2017

loganmc10 Apr 5, 2017

fzurita Apr 5, 2017

loganmc10 Apr 5, 2017

loganmc10 Apr 5, 2017

loganmc10 Apr 5, 2017

loganmc10 Apr 5, 2017

loganmc10 Apr 5, 2017

fzurita commented Apr 5, 2017

fzurita commented Apr 6, 2017 •

edited

fzurita commented Apr 6, 2017 •

edited

dankcushions commented Dec 24, 2017

fzurita commented Dec 24, 2017 •

edited

fzurita commented Jan 26, 2019 •

edited

gonetz commented Jan 28, 2019

Threaded OpenGL API calls #1452

Threaded OpenGL API calls #1452

Conversation

fzurita commented Apr 2, 2017 • edited

fzurita commented Apr 2, 2017 • edited

fzurita commented Apr 4, 2017

gonetz commented Apr 4, 2017

fzurita commented Apr 4, 2017

fzurita commented Apr 4, 2017 • edited

fzurita commented Apr 5, 2017 • edited

AmbientMalice commented Apr 5, 2017

fzurita commented Apr 5, 2017

loganmc10 commented Apr 5, 2017

AmbientMalice commented Apr 5, 2017

fzurita commented Apr 5, 2017

fzurita commented Apr 5, 2017

loganmc10 commented Apr 5, 2017

loganmc10 Apr 5, 2017

Choose a reason for hiding this comment

fzurita Apr 5, 2017

Choose a reason for hiding this comment

loganmc10 Apr 5, 2017

Choose a reason for hiding this comment

loganmc10 Apr 5, 2017

Choose a reason for hiding this comment

loganmc10 Apr 5, 2017

Choose a reason for hiding this comment

loganmc10 Apr 5, 2017

Choose a reason for hiding this comment

loganmc10 Apr 5, 2017

Choose a reason for hiding this comment

fzurita commented Apr 5, 2017

fzurita commented Apr 6, 2017 • edited

fzurita commented Apr 6, 2017 • edited

dankcushions commented Dec 24, 2017

fzurita commented Dec 24, 2017 • edited

fzurita commented Jan 26, 2019 • edited

gonetz commented Jan 28, 2019

fzurita commented Apr 2, 2017 •

edited

fzurita commented Apr 2, 2017 •

edited

fzurita commented Apr 4, 2017 •

edited

fzurita commented Apr 5, 2017 •

edited

fzurita commented Apr 6, 2017 •

edited

fzurita commented Apr 6, 2017 •

edited

fzurita commented Dec 24, 2017 •

edited

fzurita commented Jan 26, 2019 •

edited