-
-
Notifications
You must be signed in to change notification settings - Fork 35.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebGLRenderer: Tight morph target packing. #27768
base: dev
Are you sure you want to change the base?
Conversation
📦 Bundle sizeFull ESM build, minified and gzipped.
🌳 Bundle size after tree-shakingMinimal build including a renderer, camera, empty scene, and dependencies.
|
It seems the |
One example that demonstrates instanced morph targets is sufficient. Do you mind modifying |
int y = texelIndex / morphTargetsTextureSize.x; | ||
int x = texelIndex - y * morphTargetsTextureSize.x; | ||
|
||
ivec3 morphUV = ivec3( x, y, morphTargetIndex ); | ||
return texelFetch( morphTargetsTexture, morphUV, 0 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are multiple fetches really faster than a single one? I'm afraid this needs to be tested one more than one device. Especially mobile devices could show a performance regression here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is currently with rgba format it will always fetch 4 floats so fetching 3 instead should always be faster especially on mobiles where bandwidth is precious. I'll replace the horse example with the faces one and do some more testing but I only have one android phone with an older snapdragon and the quest 3 which all show similar improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you could also add a slider to webgl_instancing_morph
that controls the number of instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have the faces instead because they are using more influences and also morph the normals so its a more complete case than the horses.
I saw that it's failing when checking the screenshot with "Diff wrong in 0.4% of pixels in file: webgl_loader_mmd" but this is irrelevant to any of the files in this PR. I've seen it happening in the past and usually triggering the tests again with a commit fixes it but I've done it once and it didn't. I've checked the files on my repo and they look identical on the eye, maybe its a transparency issue? I'm a bit lost on this one, any help would be appreciated |
@@ -20,12 +20,34 @@ export default /* glsl */` | |||
|
|||
vec4 getMorph( const in int vertexIndex, const in int morphTargetIndex, const in int offset ) { | |||
|
|||
int texelIndex = vertexIndex * MORPHTARGETS_TEXTURE_STRIDE + offset; | |||
int texelIndex = vertexIndex * MORPHTARGETS_TEXTURE_STRIDE + 3 * offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to multiply the offset
with 3
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep in line with what we previously had on the shader chunks that call getMorph. Now the position is at offset 0, normals at offset 3 and colors at offset 6. Alternatively we can modify the other shader chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid any misunderstanding, you want me to change the other shader chunks to eliminate the multiplication with 3? At some point we will probably have to revisit that part anyway if/when we add additional morphing attributes like uv. When that time comes more extensive changes on how we handle morphing will be needed because now it's a bit fixed like if there's a color morph but not normals the latter will be empty space in the texture. This is how it used to be even before this PR but we can make it more flexible when more attributes are introduced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the current version of this PR 👍 .
@Mugen87 that artifact was caused by an incorrect calculation of the texture width when it exceeded the hardware maximum. Its fixed now, I did quite a bit of testing and it seems solid. I've also added some sliders on the example to tweak the instance count and set it up to start conservative so its much more useful now for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good now! The only issue is that performance testing isn't really possible since there is no example that allows to compare the existing and new approach. Especially since the reference example has been updated with this PR. Given the advantage in bandwidth cost I still think it's worth to given this new approach a try.
@Mugen87 I've set up the example with the current dev to compare |
On a Pixel 4a with latest Chrome I see this with the default grid settings: Current Dev: 23 FPS On an iPad (8.Gen) with Safari with max grid settings: Current Dev: 24 FPS So I can't confirm based on both devices that the new implementation is in general faster. Fetching a single RGBA texel seems to be faster than fetching 3 Red texels on the above devices. |
Performance regressions on mobile devices.
Sry, I have to revert my approval. When can't merge the PR if we see performance degradations. This needs more testing especially since most devices outside there are mobile. |
No worries that's why I set up the examples. At least on my phone with an adreno 640 I see consistent improvement on all grid size eg on 3 x 3 grid it's 34 vs 31 fps I believe you run into thermal throttling issues but yeah we can keep it in testing for a bit. If you want we can at least include this example instead of the horse one as its way more useful for testing |
That would be great! |
@Mugen87 @mrdoob yeah your results seem to be consistent with what I get as well but I made a shocking discovery. I tried the examples on a Quest 3 where current dev gives 40 fps while this PR gives almost 90 with the default grid. This can't be attributed to bandwidth savings alone there seems to be an issue with fetching rgba floats there and if that's the case it probably affects skinned meshes as well as the bone matrices are also stored as rgba. @cabanier can you verify? I will investigate further |
@Mugen87 I did a lot of testing including stress testing skinned meshes and it seems that the quest exhibits this behaviour only when used with instanced morph targets with the current dev. The only thing that comes to my mind is that because we are fetching the weights from a single channel texture and then the morph attributes from rgba ones, this triggers some issue in that platform/driver. With this PR where both textures are single channel it seems that it doesn't In any case, on all my devices I can only see improvement and I think @mrdoob also has a similar experience from what he posted. How should I handle it? Should I clean this PR and leave only the example, set up the cases live and open an issue for people to test. Or should I leave this PR open and somehow invite people to share here? Please advise |
Let's do this. It would be good if a few more people could test both links (ideally with mobile devices) and report performance data. If the majority of devices show an improvement, I'm okay with approving the PR. |
That is quite the improvement!
I'm out of town this week with no access to a device. Can you describe the change you made? I can forward to our team that works on the low level graphics stack to get their opinion. |
The context is that we are instancing meshes that do morph target animations. Each instance has its own set of weights for the morph targets. In the current dev we fetch some floats from a single channel texture and fill an array with the weights for the morph targets, then we iterate that array and if an element is > 0 we fetch one or more rgba texels from an array texture to get the attributes for the morphing(position/normal/color). shaderchunk In this PR we still fetch the weights as before but we convert the attribute array texture to single channel as well and fetch the xyz components for position/normal individually to save bandwidth because in the rgba case we have one component which is always set to zero. shaderchunk I tested the examples on the Quest 3 and the speedup(>100%) is way more than what we could theoretically achieve which leads me to believe that the setup of the current dev causes some issue with the stack/shader compilation which degrades performance You can use the links in this comment for reference #27768 (comment) |
light.castShadow = true; | ||
const renderer = new THREE.WebGLRenderer( { | ||
antialias: true, | ||
powerPreference: 'high-performance' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to set this to 'high-performance'?
We're still looking into why this speedup might happen but someone noticed that you added |
I tried removing it on both the current dev and this PR, I even set it to "low-power" but it didn't make a difference. From my understanding this hint applies to dual gpu setups and instructs the browser to select the better one for the webgl context. Like eg in laptops where you have an intel alongside an nvidia. You use it in your stack to enable optimizations? From my limited knowledge on graphic stacks, I suspect that the slowdown in the current dev happens because different texture formats are used and the stack has to do some reconfiguration on the fly to switch the indexing from one format to the other and this is expensive. But again my knowledge is limited. |
FYI, we're still looking into why this is making such a huge difference. We are suspecting that this could have a more efficient caching strategy but haven't proved it yet. |
FYI I implemented this PR in the WebGPURenderer and did run some test in both WebGPU and WebGL Backends and using Tight packing via Red Channel seems to increase by 5 to 10% the GPU usage. |
This is for apple silicon from what I can see in your PR, there are probably alignment issues there but other architectures behave differently. On intel/nvidia/amd and some mobile chipsets we see a good speedup. On the Q3 specifically, and maybe snapdragons in general, we see a major regression when we mix single and multi channel float textures, as the current dev does, in the vertex shader. I'm still waiting for some insight by meta and @rcabanier on this |
While optimizing morph target performance for this demo I noticed that positions and normals carry an unused component.
This PR changes the morph target texture from rgba -> single float to enable fetching only the xyz components for positions and normals while for color it still stores and fetches the full rgba set.
It is functionally identical as before but gives 25% better fps (23 vs 18) on my laptops 3060 in this example which I also include in the PR. If you don't want it I can take it out but it would be nice to have something to stress test morph targets.