Use instanced array buffer instead of UBO for canvas item batching #70065

clayjohn · 2022-12-14T17:26:58Z

This simplifies the generated shader code which increases both performance and compile time on low end devices. However, the main purpose of this PR is to get OpenGL rendering on Apple devices again.

Fixes: #68760

May fix: #68980
May fix: #68819
May fix: #67595

Background

This started as a search for a workaround to the issues we were facing on Apple devices, but as it turns out, this new approach can also be more efficient on some devices, especially low end devices.

Basically what this PR does is switch from using a giant Uniform Buffer Object to using a giant Vertex Buffer Object. Instead of dynamically indexing into an array of uniforms (using instance ID is the index) in the shader, we push instance rate attributes to the shader. This improves codegen as there is no dynamic indexing.

For Uniform Buffers we had the ability to bind a specific range, for Vertex Buffers we don't, but we can specify an offset while binding the buffer. Unfortunately, this means we need to rebind the buffer for every batch. In my testing this doesn't carry a noticeable performance penalty.

By way of background, the approach we were using is often referred to as "vertex pulling" i.e. we are using an index to pull vertex data from some buffer. This contrasts with "vertex pushing" which is when we use attributes to push vertex data to a shader (so the shader only sees the data for the current vertex). On modern desktop hardware vertex pulling can be way more efficient as it often allows you to get by with much less data.

In our case, since we only need information on a per-vertex or per-instance basis, the built in tools in OpenGL for vertex attributes and instance attributes work just fine. There is an overhead to instancing, however, in my testing it appears the benefits outweigh the costs.

Performance analysis

I carried out the best analysis I could of rendering performance. For the most part this means that I profiled GPU operations using the best tools available for the platform. On macOS I was limited to printing FPS which is understandably a poor way of measuring GPU performance.

All measurements were taken with the Project Manager

Platform	Device	4.0-dev	This PR	Notes
Windows	RX 6900XT	1.1-1.4ms	1.0ms	Renderdoc
PopOS 22.04	i7-1165G7 (Xe graphics)	1.6ms	1.4ms	Renderdoc
Windows	i5-8265U 1.6GHz (intel HD 620)	1.7-1.9ms	1.5-1.6ms	Renderdoc
Windows	i5-8265U 1.6GHz (intel HD 620)	0.8ms	0.5ms	Intel Graphics Monitor
macOS 12.6.1	2.7 GHz Dual core intel i5	N/A	6-10ms*	`--print-fps`
Android 13	Pixel 4	5.3ms	5.2ms	Android GPU Inspector
Android 13	Pixel 7	4.7ms	4.4ms	Android GPU Inspector
Lubuntu 20.04	Intel Celeron N2840 2.16GHz**	6.5ms	4.4ms	Renderdoc

On macOS I forcibly disabled low processor mode and vsync However, It appears some form of vsync was still occurring, as the times it was at 6ish ms it appeared locked at 144 FPS
** This device is a 2014 Chromebook with ChromeOS wiped

Given these results, I feel comfortable saying that this PR will run at least as fast as the previous approach and may even run faster on some devices.

Further work

I quickly tried out two improvements that are promising but did not show a performance improvement on my main development device (there may be improvements on other devices though). Noting the approaches here for future reference:

Add a "single instance" mode to the shader so batches with a single instance avoid utilizing the vertex buffer and instead pass their per-instance data in a single uniform. In theory this should be faster as it still avoids dynamic indexing and reduces vertex pressure/bandwidth, however in testing on my main device this resulted in a marginal decrease in performance (perhaps due to all the context switching) (https://github.com/clayjohn/godot/tree/GLES3-SI) Needs testing on bandwidth restricted devices (i.e. mobile)
Use non-instanced variations of draw functions when instance count is 1 (i.e. glDrawArrays instead of glDrawArraysInstanced). Doing so resulted in no performance benefit. This may be because the drivers on my development device optimize for this situation already. This should be tested on a wider variety of hardware.

qarmin · 2022-12-15T10:07:48Z

Tested that on Linux fixed grey screen on Intel HD 3000

This simplifies the generated shader code which increases both performance and compile time on low end devices

clayjohn · 2022-12-15T16:26:17Z

Rebased to fix CI

akien-mga

Sounds like the right move! Didn't review the code in depth but the best review will be testing in beta 9 :)

akien-mga · 2022-12-15T17:45:13Z

Thanks!

lawnjelly · 2022-12-15T17:55:57Z

I had a little look through this morning and it looked fine from what I saw. 👍

clayjohn added bug enhancement platform:macos topic:rendering high priority topic:2d labels Dec 14, 2022

clayjohn added this to the 4.0 milestone Dec 14, 2022

clayjohn requested a review from a team as a code owner December 14, 2022 17:27

Use instanced array buffer instead of UBO for canvas item batching

b6a1aa1

This simplifies the generated shader code which increases both performance and compile time on low end devices

clayjohn force-pushed the GLES3-attribs branch from a975436 to b6a1aa1 Compare December 15, 2022 16:25

akien-mga approved these changes Dec 15, 2022

View reviewed changes

akien-mga merged commit 47ef054 into godotengine:master Dec 15, 2022

clayjohn deleted the GLES3-attribs branch December 15, 2022 18:01

clayjohn mentioned this pull request Jan 2, 2023

HTML5 export for Godot 4.x takes 1-2 minutes to load on macOS #70691

Open

clayjohn mentioned this pull request Jan 29, 2023

Remove cap on number of items drawn in frame in 2D gl_compatibility renderer #72291

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use instanced array buffer instead of UBO for canvas item batching #70065

Use instanced array buffer instead of UBO for canvas item batching #70065

clayjohn commented Dec 14, 2022

qarmin commented Dec 15, 2022

clayjohn commented Dec 15, 2022

akien-mga left a comment

akien-mga commented Dec 15, 2022

lawnjelly commented Dec 15, 2022

Use instanced array buffer instead of UBO for canvas item batching #70065

Use instanced array buffer instead of UBO for canvas item batching #70065

Conversation

clayjohn commented Dec 14, 2022

Background

Performance analysis

Further work

qarmin commented Dec 15, 2022

clayjohn commented Dec 15, 2022

akien-mga left a comment

Choose a reason for hiding this comment

akien-mga commented Dec 15, 2022

lawnjelly commented Dec 15, 2022