Use a giant UBO to optimize performance in 2D [OpenGL3] #66861

clayjohn · 2022-10-04T03:46:11Z

This is primarily an optimization and cleanup PR. I've had this idea since I first implemented the 2D renderer and have waited until now to implement it.

Previously I was disappointed in the performance of the old batching method on low-end devices. On high-end devices it was great, but the Opengl3 renderer is supposed to be the low-end focused renderer. So I felt like I needed to rethink things.

During the Godot sprint, Juan and I discussed some factors that may be slowing down performance disproportionately on the low end, among them are:

performance penalty from relying on gl_InstanceID
performance penalty from having small number of instances when using glDrawArraysInstances
performance penalty from having many small buffer uploads

The problem

The old batching renderer worked as follows:

Prepare an empty batch
For every CanvasItem:
2.a. set OpenGL state
2.b. for every canvas item command:
2.b.i check if can batch
2.b.ii. if can batch -> add to batch
2.b.iii. if can't batch -> upload current batch data to new UBO, if needed set OpenGL state, render batch

This worked really well on high-end devices as it allowed the GPU to parallelize UBO buffer uploads and draw commands. On older devices we ended up with a huge performance penalty for drawing right after upload and it appeared that each draw command was still sequential.

The end result was that small draws were taking up about 4x as much time as they should. In practice, all batches took at least as much time as a batch with about 10 elements.

The solution

The solution is to record batches in advance, upload all the batches to one UBO and then issue draw commands from that UBO. To do so, we rely on the fact that a UBO can be as large as we want as long as we only bind the maximum UBO size we are fine

The new batching renderer works as follows:

Prepare an empty batch
For every CanvasItem:
2.a. init batch data
2.b. for every canvas item command:
2.b.i. check if can batch
2.b.ii. if can batch -> add to batch
2.b.iii. if can't batch -> create new batch
for every batch
3.a. Bind range of UBO needed
3.b. set opengl state
3.c. render batch

This significantly cuts down on the cost of uploading the draw data as well as minimizes the time the draw commands need to wait for the data upload.

Additionally instead of using instanced drawing to draw our batches we rely on a dummy element array that is set up to draw 512 quads (4 vertices, 6 indices). This gets around the performance penalty of using small instances with instanced rendering. Small batches now render much faster.

Metrics

Memory usage:

Previously memory usage was batch_max_size(512) * instance_size (128 bytes) * total batches in viewport * 3 for each viewport

Typically in editor we have a few hundred UBOs in play: ~40mb

New memory usage is: max_instance_count (configurable, defaults to 16384) * instance_size (128 bytes) * 3 for each viewport

Typically in editor we have 4 total UBOs: ~8mb

Performance

Depending on the device I measured performance using either RenderDoc or Intel Graphics Analyzer. Accordingly the absolute values are not necessarily accurate, the relative values however should be mostly correct or at least within a reasonable range.

CPU name	Before	After	Difference
Ryzen 5 3600 (AMD dedicated graphics)	3.5ms	2.4ms	32%
i7-1165G7 (integrated graphics)	5ms	3.5ms	30%
i5-8265U	11.7ms	7.6ms	35%
Intel® Dual-Core Celeron® N2830 Processor	33ms	13ms	60%

Fixes: #65977
Fixes: #66463

The future

The first performance issue I identified is not fully solved, we are still using gl_VertexID to read per-instance data. This incurs the same penalty as using gl_InstanceID, in either case the value is not uniform for the draw call. We can mitigate this in two ways:

Have a special single-item-batch pathway that uses a constant index of 0 instead of calculating the index from gl_VertexID
Pack as much data into flat varyings so that the values are at least uniform for the fragment shader which is where low-end devices spend most of their time.

servers/rendering_server.cpp

lawnjelly · 2022-10-05T09:21:04Z

drivers/gles3/rasterizer_canvas_gles3.h

+		uint32_t start = 0;
+		uint32_t instance_count = 0;
+
+		RID tex = RID();


Does RID() self initialize in 4.x?

I believe RID tex = RID(); is the same as RID tex; if that is what you are asking

Yeah so the point is that usually we wouldn't explicitly initialize it in the header, like Vector3 or String. But it's not a big deal :)

lawnjelly

Looks fine to me. I haven't given a hugely in depth look, and have done some basic testing and it seems to work okay.

Any more reviewers obviously welcome, but I suspect this will be a merge and then continuous improvement / bug fixing.

This removes the countless small UBO writes we had before and replaces them with a single large write per render pass. This results in much faster rendering on low-end devices but improves speed on all devices.

clayjohn · 2022-10-06T18:37:32Z

Just force pushed an update to resolve merge conflicts. Should be ready to merge now

akien-mga · 2022-10-07T07:16:15Z

Thanks!

clayjohn added bug enhancement topic:rendering labels Oct 4, 2022

clayjohn added this to the 4.0 milestone Oct 4, 2022

clayjohn requested review from BastiaanOlij, reduz and lawnjelly October 4, 2022 03:46

clayjohn requested review from a team as code owners October 4, 2022 03:46

clayjohn force-pushed the GLES3-mono-ubo branch from ae7b8df to fb0f58a Compare October 4, 2022 05:25

akien-mga reviewed Oct 4, 2022

View reviewed changes

servers/rendering_server.cpp Show resolved Hide resolved

lawnjelly reviewed Oct 5, 2022

View reviewed changes

lawnjelly approved these changes Oct 5, 2022

View reviewed changes

Use a giant UBO to optimize performance in 2D

154b9c1

This removes the countless small UBO writes we had before and replaces them with a single large write per render pass. This results in much faster rendering on low-end devices but improves speed on all devices.

clayjohn force-pushed the GLES3-mono-ubo branch from fb0f58a to 154b9c1 Compare October 6, 2022 18:37

akien-mga merged commit 29f0173 into godotengine:master Oct 7, 2022

clayjohn deleted the GLES3-mono-ubo branch October 7, 2022 14:45

Gnumaru mentioned this pull request Oct 28, 2022

OpenGL: Editor right panel tabs disappear when loading a tileset with an atlas with dimensions greather than 703x703 #67942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a giant UBO to optimize performance in 2D [OpenGL3] #66861

Use a giant UBO to optimize performance in 2D [OpenGL3] #66861

clayjohn commented Oct 4, 2022 •

edited

Loading

lawnjelly Oct 5, 2022

clayjohn Oct 6, 2022

akien-mga Oct 7, 2022

lawnjelly left a comment

clayjohn commented Oct 6, 2022

akien-mga commented Oct 7, 2022

Use a giant UBO to optimize performance in 2D [OpenGL3] #66861

Use a giant UBO to optimize performance in 2D [OpenGL3] #66861

Conversation

clayjohn commented Oct 4, 2022 • edited Loading

The problem

The solution

Metrics

Memory usage:

Performance

The future

lawnjelly Oct 5, 2022

Choose a reason for hiding this comment

clayjohn Oct 6, 2022

Choose a reason for hiding this comment

akien-mga Oct 7, 2022

Choose a reason for hiding this comment

lawnjelly left a comment

Choose a reason for hiding this comment

clayjohn commented Oct 6, 2022

akien-mga commented Oct 7, 2022

clayjohn commented Oct 4, 2022 •

edited

Loading