Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a giant UBO to optimize performance in 2D [OpenGL3] #66861

Merged
merged 1 commit into from
Oct 7, 2022

Conversation

clayjohn
Copy link
Member

@clayjohn clayjohn commented Oct 4, 2022

This is primarily an optimization and cleanup PR. I've had this idea since I first implemented the 2D renderer and have waited until now to implement it.

Previously I was disappointed in the performance of the old batching method on low-end devices. On high-end devices it was great, but the Opengl3 renderer is supposed to be the low-end focused renderer. So I felt like I needed to rethink things.

During the Godot sprint, Juan and I discussed some factors that may be slowing down performance disproportionately on the low end, among them are:

  1. performance penalty from relying on gl_InstanceID
  2. performance penalty from having small number of instances when using glDrawArraysInstances
  3. performance penalty from having many small buffer uploads

The problem

The old batching renderer worked as follows:

  1. Prepare an empty batch
  2. For every CanvasItem:
    2.a. set OpenGL state
    2.b. for every canvas item command:
    2.b.i check if can batch
    2.b.ii. if can batch -> add to batch
    2.b.iii. if can't batch -> upload current batch data to new UBO, if needed set OpenGL state, render batch

This worked really well on high-end devices as it allowed the GPU to parallelize UBO buffer uploads and draw commands. On older devices we ended up with a huge performance penalty for drawing right after upload and it appeared that each draw command was still sequential.

The end result was that small draws were taking up about 4x as much time as they should. In practice, all batches took at least as much time as a batch with about 10 elements.

The solution

The solution is to record batches in advance, upload all the batches to one UBO and then issue draw commands from that UBO. To do so, we rely on the fact that a UBO can be as large as we want as long as we only bind the maximum UBO size we are fine

The new batching renderer works as follows:

  1. Prepare an empty batch
  2. For every CanvasItem:
    2.a. init batch data
    2.b. for every canvas item command:
    2.b.i. check if can batch
    2.b.ii. if can batch -> add to batch
    2.b.iii. if can't batch -> create new batch
  3. for every batch
    3.a. Bind range of UBO needed
    3.b. set opengl state
    3.c. render batch

This significantly cuts down on the cost of uploading the draw data as well as minimizes the time the draw commands need to wait for the data upload.

Additionally instead of using instanced drawing to draw our batches we rely on a dummy element array that is set up to draw 512 quads (4 vertices, 6 indices). This gets around the performance penalty of using small instances with instanced rendering. Small batches now render much faster.

Metrics

Memory usage:

Previously memory usage was batch_max_size(512) * instance_size (128 bytes) * total batches in viewport * 3 for each viewport

Typically in editor we have a few hundred UBOs in play: ~40mb

New memory usage is: max_instance_count (configurable, defaults to 16384) * instance_size (128 bytes) * 3 for each viewport

Typically in editor we have 4 total UBOs: ~8mb

Performance

Depending on the device I measured performance using either RenderDoc or Intel Graphics Analyzer. Accordingly the absolute values are not necessarily accurate, the relative values however should be mostly correct or at least within a reasonable range.

CPU name Before After Difference
Ryzen 5 3600 (AMD dedicated graphics) 3.5ms 2.4ms 32%
i7-1165G7 (integrated graphics) 5ms 3.5ms 30%
i5-8265U 11.7ms 7.6ms 35%
Intel® Dual-Core Celeron® N2830 Processor 33ms 13ms 60%

Fixes: #65977
Fixes: #66463

The future

The first performance issue I identified is not fully solved, we are still using gl_VertexID to read per-instance data. This incurs the same penalty as using gl_InstanceID, in either case the value is not uniform for the draw call. We can mitigate this in two ways:

  1. Have a special single-item-batch pathway that uses a constant index of 0 instead of calculating the index from gl_VertexID
  2. Pack as much data into flat varyings so that the values are at least uniform for the fragment shader which is where low-end devices spend most of their time.

uint32_t start = 0;
uint32_t instance_count = 0;

RID tex = RID();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does RID() self initialize in 4.x?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe RID tex = RID(); is the same as RID tex; if that is what you are asking

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so the point is that usually we wouldn't explicitly initialize it in the header, like Vector3 or String. But it's not a big deal :)

Copy link
Member

@lawnjelly lawnjelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. I haven't given a hugely in depth look, and have done some basic testing and it seems to work okay.

Any more reviewers obviously welcome, but I suspect this will be a merge and then continuous improvement / bug fixing.

This removes the countless small UBO writes we had before
and replaces them with a single large write per render pass.

This results in much faster rendering on low-end devices
but improves speed on all devices.
@clayjohn
Copy link
Member Author

clayjohn commented Oct 6, 2022

Just force pushed an update to resolve merge conflicts. Should be ready to merge now

@akien-mga akien-mga merged commit 29f0173 into godotengine:master Oct 7, 2022
@akien-mga
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenGL: Font rendering quality is worse than in Vulkan OpenGL: GUI rendering issues when hiding nodes
3 participants