Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded OpenGL API calls #1452

Closed
wants to merge 18 commits into from
Closed

Conversation

fzurita
Copy link
Contributor

@fzurita fzurita commented Apr 2, 2017

This is the ongoing work to get a threaded implementation of the OpenGL API going.

This is not threaded yet, but it's the refactoring I have had to do so far that has been needed to support that.

Currently this does seem to have a performance regression, I have to figure out where it's coming from.

@fzurita
Copy link
Contributor Author

fzurita commented Apr 2, 2017

The performance penalty comes from the unbuffered drawer. Probably because I can't cache the vertices any more.

I think I know how I will fix this. I will use the old version of the unbuffered drawer when in single threaded mode. I will use the current version in threaded mode or the buffered drawer if the hardware allows.

Edit:

Ok, done. I made a "thread safe" version of the UnbufferedDrawer. The plugin runs about ~25% slower using that version. This will be used in threaded mode whenever the hardware doesn't support Buffer Storage extension. I can't think of a clean way to fix that for devices without the extension, hopefully the threading makes up for it for those devices.

Edit2:
I was able to come up with an optimization that is thread safe and gives me 95% of the performance of the non thread safe way of doing it.

@fzurita fzurita force-pushed the threaded_GLideN64 branch 4 times, most recently from 942cf85 to 9eca4bf Compare April 4, 2017 04:14
@fzurita
Copy link
Contributor Author

fzurita commented Apr 4, 2017

Ok, I enabled threaded mode and of course it didn't work right off the bat.

I'm getting INVALID_OPERATION on glMapBufferRangeCommand. This doesn't happen when not in threaded mode with the same arguments. Here are the arguments:

target=8892 offset=0 length=4194304 access=c2

Anyone have any ideas?

@gonetz
Copy link
Owner

gonetz commented Apr 4, 2017

https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glMapBufferRange.xhtml

There are many cases for INVALID_OPERATION. I'd check "if zero is bound to target" and "is already in a mapped state" first.

@fzurita
Copy link
Contributor Author

fzurita commented Apr 4, 2017

I found at least that problem. There were instances of the old unwrapped gl calls in the code still left. GET_GL_FUNCTION to be more precise. Now I just need to figure out how to fix those.

@fzurita
Copy link
Contributor Author

fzurita commented Apr 4, 2017

Ok, good news, it works! For some reason the FPS display is not working so I can't judge performance. And for some reason the thread is not exiting when core exits.

Also, I expect performance to be identical at first since there are many operations that we are doing that are stalling the pipeline. So until I fix those, I don't expect any improvements.

Edit: I think a good solution for now would be if there is more than 1 swap buffers call in the queue, then prevent the main thread from executing until the video plugin catches up. I think that will work pretty well.

@fzurita
Copy link
Contributor Author

fzurita commented Apr 5, 2017

Ok, after performing several optimizations on the threading. I'm seeing about a 25% performance improvement in FPS in CBFD. I'm thinking that the percentage improvement in FPS is going to correspond to how much time the emulation spends in the core.

@AmbientMalice
Copy link
Contributor

I'm seeing about a 25% performance improvement in FPS in CBFD.

Is this on desktop or mobile, and a 25% improvement on what baseline? (Unthreaded, or unoptimized threaded?)

@fzurita
Copy link
Contributor Author

fzurita commented Apr 5, 2017

This is in mobile. Unthreaded vs threaded. Threaded has 25% better performance in that game.

@loganmc10
Copy link
Contributor

Let me know when you feel it's ready for testing (and how to turn it on and off). I suspect it might be driver dependant. I think Nvidia has threaded optimizations built into the driver, so this might just add extra overhead on top of that, but testing will reveal that.

@AmbientMalice
Copy link
Contributor

Isn't AMD notoriously poor at threading? This might help reduce CPU overheads on AMD.

@fzurita
Copy link
Contributor Author

fzurita commented Apr 5, 2017

Right, I don't think it's ready yet. There are a few bugs I need to work out.

For example, in Mario 64, the core runs so fast when fast forwarding, that the OpenGL command queue keeps growing indefinitely. I need to figure out a way to drop non essential commands and skip frames.

@fzurita
Copy link
Contributor Author

fzurita commented Apr 5, 2017

This will only help improve performance by allowing the GPU to keep doing work while the video plugin is not executing. So if the emulator spends 80% of the time in the video plugin, we can get an additional 20% performance.

Also, this will not help at all if we have to do read backs from video memory to CPU memory in sync mode. For example, if we need to read the color buffer in sync. If we read the color buffer in async mode though, this will help.

@fzurita fzurita force-pushed the threaded_GLideN64 branch 3 times, most recently from 1d51917 to c55b2e2 Compare April 5, 2017 06:41
@loganmc10
Copy link
Contributor

For example, in Mario 64, the core runs so fast when fast forwarding, that the OpenGL command queue keeps growing indefinitely. I need to figure out a way to drop non essential commands and skip frames

Isn't this essentially the problem with running the OpenGL calls in their own thread? Won't this de-sync the video from the rest of the emulator? I'm still having a hard time understanding how this would improve performance without causing other issues with the emulation

@@ -60,7 +62,7 @@ void ContextImpl::init()
}

{
if ((m_glInfo.isGLESX && (m_glInfo.bufferStorage && m_glInfo.majorVersion * 10 + m_glInfo.minorVersion > 32)) || !m_glInfo.isGLESX)
if ((m_glInfo.isGLESX && m_glInfo.bufferStorage) || !m_glInfo.isGLESX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was m_glInfo.majorVersion * 10 + m_glInfo.minorVersion > 32 removed?

BufferedDrawer depends on glDrawElementsBaseVertex, which is only available in GLES 3.2+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. I should put that back.

}
FunctionWrapper::glLineWidth(_width);
FunctionWrapper::glDrawArrays(GL_LINES, 0, 2);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a newline here

@@ -26,4 +26,4 @@ namespace opengl {
std::array<const void*, MaxAttribIndex> m_attribsData;
};

}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a newline here

namespace opengl {

std::array<std::shared_ptr<std::vector<char>>, MaxAttribIndex> GlVertexAttribPointerUnbufferedCommand::m_attribsData;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a newline here

::CoreVideo_GL_SwapBuffers();
}
};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a newline here

{
executeCommand(std::make_shared<GlTextureSubImage2DUnbufferedCommand<pixelType>>(texture, level, xoffset, yoffset, width, height, format, type, std::move(pixels)));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline

@fzurita
Copy link
Contributor Author

fzurita commented Apr 5, 2017

By the way, this should allow us to implement frame skipping based on command queue size. So if there is more than one "swap buffers" command in the queue. We should be able to drop some commands to catch up. I just need to figure out what commands are safe to drop.

@fzurita
Copy link
Contributor Author

fzurita commented Apr 6, 2017

Ok, I'm making a lot of head way. I fixed numerous bugs. Also, before swapping buffers, we will check if there is already a queued swap buffer ahead in the queue. If there is, we will wait until that is executed before continuing.

This prevents the situation where the GL command queue gets too far behind the emulation core. In the long run though, I think we want to skip frames.

Here are a few more things I need to do to call this complete:

  • Add project 64 support (I may need some help with this since I'm not familiar with the API)
  • Make this setting configurable
  • Figure out why single threaded performance is poor in some situations. I can always fallback to some of the old code for this. (I may have been measuring wrong, I can't find a performance reduction now)
  • Add async glReadPixels support, right now all glReadPixels calls will for the command queue to empty, eliminating the benefit of threading.
  • If BufferedDrawer is not supported, don't use UnbufferedDrawer while using theads, instead use UnbufferedDrawerThreadSafe
  • Take care of any comments or additional issues that show up.
  • Optimize UnbufferedDrawerThreadSafe. It's way too slow, which causes devices that don't sorry the buffered drawer to actually run sorry in threaded mode.

I may add to this as I see problems.

@fzurita
Copy link
Contributor Author

fzurita commented Apr 6, 2017

@loganmc10 Can you give this a test? You can enable this by setting "ThreadedVideo" to true in mupen64plus.

Edit: Also, keep in mind that this only works mupen64plus and I have yet to update the CMake file.

@dankcushions
Copy link
Contributor

Here goes! Now, for whatever reason I've never been able to get VerticalSync = False to work on my raspberry - it always seems like it's synced (never goes above ~60), so it's difficult to tell performance gains, but I can usually check regressions. Current Master results first, ThreadedVideo = True results in bold

Mario 64
during mario's face intro
59VIS 57-61
29FPS 25-30
(about the same)

idling outside castle
57-61VIS 58-61
28-31FPS 28-31
(about the same)

inside front door
57-61VIS 58-61
28-31FPS 27-31
(about the same)

idling at start of bob-omb battlefield
42-45VIS 44-53
22-25FPS 23-25
(slightly better?)

Goldeneye 64
(curiously in both, all wall and floor textures were unfiltered)

during bond's gun barrel walk.
29VIS 28
28FPS 28
(about the same)

dam intro pan
60 VIS 60
25 FPS 27
(about the same)

idling at start of dam
44-51VIS 25-63
22-27FPS 22-31
(slightly better?)

I'd say there's no regressions and it seems like a slight improvement, but I feel the benchmarking I am able to do is not definitive :( I wonder if @psyke83 or @gizmo98 can think of a better way?

@fzurita
Copy link
Contributor Author

fzurita commented Dec 24, 2017

I'm glad to hear that at least it's not a performance penalty. I think implementing the object pool helped with that.

I think this code really helps when Async color to RDRAM is enabled which the RPI doesn't have.

@fzurita
Copy link
Contributor Author

fzurita commented Jan 26, 2019

Rebased against latest master.

Tested a little bit and the performance boost seems to still be there, specially with Async Color buffer to RDRAM enabled. Around a 10% performance improvement on my device, but it will depend on how fast color buffer copies are.

@fzurita fzurita force-pushed the threaded_GLideN64 branch 2 times, most recently from d28eb0c to c59c65b Compare January 26, 2019 21:17
@gonetz
Copy link
Owner

gonetz commented Jan 28, 2019

Merged to fzurita-threaded_GLideN64 branch.

@fzurita fzurita closed this Feb 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants