-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a giant UBO to optimize performance in 2D [OpenGL3] #66861
Conversation
ae7b8df
to
fb0f58a
Compare
uint32_t start = 0; | ||
uint32_t instance_count = 0; | ||
|
||
RID tex = RID(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does RID() self initialize in 4.x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe RID tex = RID();
is the same as RID tex;
if that is what you are asking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so the point is that usually we wouldn't explicitly initialize it in the header, like Vector3
or String
. But it's not a big deal :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me. I haven't given a hugely in depth look, and have done some basic testing and it seems to work okay.
Any more reviewers obviously welcome, but I suspect this will be a merge and then continuous improvement / bug fixing.
This removes the countless small UBO writes we had before and replaces them with a single large write per render pass. This results in much faster rendering on low-end devices but improves speed on all devices.
fb0f58a
to
154b9c1
Compare
Just force pushed an update to resolve merge conflicts. Should be ready to merge now |
Thanks! |
This is primarily an optimization and cleanup PR. I've had this idea since I first implemented the 2D renderer and have waited until now to implement it.
Previously I was disappointed in the performance of the old batching method on low-end devices. On high-end devices it was great, but the Opengl3 renderer is supposed to be the low-end focused renderer. So I felt like I needed to rethink things.
During the Godot sprint, Juan and I discussed some factors that may be slowing down performance disproportionately on the low end, among them are:
The problem
The old batching renderer worked as follows:
2.a. set OpenGL state
2.b. for every canvas item command:
2.b.i check if can batch
2.b.ii. if can batch -> add to batch
2.b.iii. if can't batch -> upload current batch data to new UBO, if needed set OpenGL state, render batch
This worked really well on high-end devices as it allowed the GPU to parallelize UBO buffer uploads and draw commands. On older devices we ended up with a huge performance penalty for drawing right after upload and it appeared that each draw command was still sequential.
The end result was that small draws were taking up about 4x as much time as they should. In practice, all batches took at least as much time as a batch with about 10 elements.
The solution
The solution is to record batches in advance, upload all the batches to one UBO and then issue draw commands from that UBO. To do so, we rely on the fact that a UBO can be as large as we want as long as we only bind the maximum UBO size we are fine
The new batching renderer works as follows:
2.a. init batch data
2.b. for every canvas item command:
2.b.i. check if can batch
2.b.ii. if can batch -> add to batch
2.b.iii. if can't batch -> create new batch
3.a. Bind range of UBO needed
3.b. set opengl state
3.c. render batch
This significantly cuts down on the cost of uploading the draw data as well as minimizes the time the draw commands need to wait for the data upload.
Additionally instead of using instanced drawing to draw our batches we rely on a dummy element array that is set up to draw 512 quads (4 vertices, 6 indices). This gets around the performance penalty of using small instances with instanced rendering. Small batches now render much faster.
Metrics
Memory usage:
Previously memory usage was batch_max_size(512) * instance_size (128 bytes) * total batches in viewport * 3 for each viewport
Typically in editor we have a few hundred UBOs in play: ~40mb
New memory usage is: max_instance_count (configurable, defaults to 16384) * instance_size (128 bytes) * 3 for each viewport
Typically in editor we have 4 total UBOs: ~8mb
Performance
Depending on the device I measured performance using either RenderDoc or Intel Graphics Analyzer. Accordingly the absolute values are not necessarily accurate, the relative values however should be mostly correct or at least within a reasonable range.
Fixes: #65977
Fixes: #66463
The future
The first performance issue I identified is not fully solved, we are still using gl_VertexID to read per-instance data. This incurs the same penalty as using gl_InstanceID, in either case the value is not uniform for the draw call. We can mitigate this in two ways:
0
instead of calculating the index from gl_VertexID