Emulate double precision for regular rendering operation when REAL_T_IS_DOUBLE #66178

clayjohn · 2022-09-20T21:37:14Z

Fixes: #58516
Finishes @fire 's 4 years 10 months 30 days journey of double support: #12299

We calculate the lost precision on the CPU and pass it into the GPU so that it can calculate an error-corrected version of the vertex position. The general approach is detailed here: http://andrewthall.org/papers/df64_qf128.pdf

This allows rendering very large worlds as if we were using double precision in the shader when we are not.

We don't use doubles in the shader for two reasons:

Limited support for doubles. Metal shading language doesn't support them at all. On some devices using doubles in shaders causes them to crash (thats the case on my intel integrated GPU)
Even when GPUs do support doubles, performance can be poor

This method works for "normal" rendering meaning it does not work with the render modes skip_vertex_transform or world_vertex_coords in either case you end up doing calculations totally in floating point and you lose the benefit of this. By the same token, any shader operations in world space will still be a problem as the shaders are 100% floats. This means that world triplanar is still limited by the bounds of single precision floats.

I have not implemented this yet in GLES3 as I want to flesh out the 3D renderer a bit more first. Particularly, I want to ensure that this won't conflict with our floating point precision needs.

Lastly, as it is written, this approach does not work with particles/multimeshes. The same general approach can be used, but when applied to particles, this approach will result in many many times more calculations. Right now, for smaller ranges (<500km) this isn't a big deal, but once you approach 1000km the error becomes noticeable and above 10,000km the error is significant. I can add the relevant code for particles/multimeshes if desired.

Edit: I was wrong above, I thought that we would need to add a high precision path to the instance_transform * model matrix multiplication which would require decomposing the multiplication into a mat3 x mat3 multiplication and a high precision dot product (which would be a ton of calculations). However, I realized that when using the normal render path, we could keep the instance_transform separate and add its origin offset the same time we do the model/view multiplication. So I've added a code path that ensures the full model matrix is available if MODEL_MATRIX is read in the shader, if using world vertex coords, if skipping vertex transform, or if not using doubles. Finally, I made MODEL_MATRIX read only in shaders. Previously you could write to it and the value would be totally ignored.

CC @Zylann @reduz @BastiaanOlij

Comparison

At origin
Before: (Ignore the triangle, it moves to show smooth particle movement)

After:

1 billion units away!
Before:

After:

BastiaanOlij · 2022-09-21T01:18:17Z

Took me a minute to wrap my brain around it but this is a deceptively simple solution to a gnarly problem. Really cool!

We calculate the lost precision on the CPU and pass it into the GPU so that it can calculate an error-corrected version of the vertex position

TechnoPorg · 2022-09-22T00:44:25Z

I hope this isn't a ridiculous question, but should/can this be implemented for 2D as well? Based on my limited testing, it seems like 2D rendering also starts to jitter when getting millions of units out.

clayjohn · 2022-09-22T05:59:00Z

I hope this isn't a ridiculous question, but should/can this be implemented for 2D as well? Based on my limited testing, it seems like 2D rendering also starts to jitter when getting millions of units out.

I never even considered 2D. I guess it can be added if there is demand, but I wouldn't add it without reason. This code gets inserted in all qualifying shaders in the doubles build of the engine, so it has the potential to reduce performance even for normal UI stuff that would never need it.

TechnoPorg · 2022-09-22T13:16:03Z

Sounds good, that makes sense!

fire

I am not part of the rendering team which owns this, but I did successfully test with the sample project given by @Zylann and I did did the first pr on double precision.

BastiaanOlij

Discussing this in chat, feedback so far seems that this is working as advertised.

I had some concerns initially but after discussing with Clayjohn I think this is a very elegant solution.

Code wise I have nothing to add, maybe @reduz will want to have a final say but I think this is good to go.

akien-mga · 2022-09-30T08:09:17Z

Thanks!

albinaask · 2022-09-30T12:49:06Z

For reference and for clarity, is this only applicable when building godot with "bits = 64"? Or am I misunderstanding something?

Calinou · 2022-09-30T18:08:22Z

For reference and for clarity, is this only applicable when building godot with "bits = 64"? Or am I misunderstanding something?

No, float=64 can be used with 32-bit builds too. However, this is ill-advised as float=64 adds some computation overhead and 32-bit-only CPUs are often very slow.

albinaask · 2022-10-01T11:43:16Z

I agree, my question was rather whether the transformation matrix on the CPU-side must be computed in 64 bits in order to apply an adequate correction in the shader? And is this only done when Godot is compiled with "bits=64"?

realkotob · 2022-10-17T09:36:04Z

This is cool!

Will this be enabled on default for official stable builds?

clayjohn · 2022-10-17T14:49:09Z

@realkotob No, this is only applicable for "doubles" builds of the engine

expikr · 2023-07-02T18:09:18Z

Hey, I tried to implement this in Love2D, do you guys think I've got it right?

https://github.com/groverburger/g3d/pull/45/files

I basically packaged the residual component into the unused bottom three zeroes of the 4x4 matrix like so:

packagedMatrix = 
affine_xx , affine xy , affine_xz , coarse_x ;
affine_yx , affine yy , affine_yz , coarse_y ;
affine_zx , affine zy , affine_zz , coarse_z ;
residue_x , residue_y , residue_z ,    1     ;

Then unpacked them in the shader this way:

    mat4 diff = modelPacked - viewPacked; // viewpacked's third column is actually in worldspace
    vec3 displacement = diff[3].xyz + vec3(diff[0].w,diff[1].w,diff[2].w);
    mat3 modelAffine = mat3(modelPacked);
    mat4 modelView = mat4( mat3(viewPacked) * mat4x3(modelAffine[0],modelAffine[1],modelAffine[2],displacement) );
    worldPosition = vec4(modelAffine*vertexPosition.xyz, 0 ) + modelPacked[3];
    viewPosition  = modelView*vertexPosition;
    screenPosition = projectionMatrix * viewPosition;

clayjohn · 2023-07-06T11:38:50Z

Hey, I tried to implement this in Love2D, do you guys think I've got it right?

https://github.com/groverburger/g3d/pull/45/files

I basically packaged the residual component into the unused bottom three zeroes of the 4x4 matrix like so:

packagedMatrix = 
affine_xx , affine xy , affine_xz , coarse_x ;
affine_yx , affine yy , affine_yz , coarse_y ;
affine_zx , affine zy , affine_zz , coarse_z ;
residue_x , residue_y , residue_z ,    1     ;

Then unpacked them in the shader this way:

    mat4 diff = modelPacked - viewPacked; // viewpacked's third column is actually in worldspace
    vec3 displacement = diff[3].xyz + vec3(diff[0].w,diff[1].w,diff[2].w);
    mat3 modelAffine = mat3(modelPacked);
    mat4 modelView = mat4( mat3(viewPacked) * mat4x3(modelAffine[0],modelAffine[1],modelAffine[2],displacement) );
    worldPosition = vec4(modelAffine*vertexPosition.xyz, 0 ) + modelPacked[3];
    viewPosition  = modelView*vertexPosition;
    screenPosition = projectionMatrix * viewPosition;

It looks like you are missing the double_add_vec3() function in the shader. double_add_vec3() is the key behind ensuring that you retain as much precision as possible. You cannot use regular addition as you lose precision in doing so

expikr · 2023-07-07T08:30:46Z

Interesting, did some reading about the 2sum and fast2sum algorithms. I tweaked the four-vector addition to a less generalized version that discards the very last error term, as you have done in your shaders for the final position:

vec3 two_sum(vec3 a, vec3 b, out vec3 out_p) {
    vec3 s = a + b;
    vec3 v = s - a;
    out_p = (a - (s - v)) + (b - v);
    return s;
}
vec3 precise_sum(vec3 A, vec3 a, vec3 B, vec3 b) {
    vec3 D,d;
    vec3 C = two_sum(A,B,D);
    vec3 c = two_sum(a,b,d);
    vec3 CcD = C + (c+D);
    vec3 e = (c+D) - (CcD-C);
    return CcD + (d+e);
}

vec4 position(mat4 transformProjection, vec4 vertexPosition) {
    mat3 modelAffine = mat3(modelPacked);
    vec3 modelCoarse = modelPacked[3].xyz;
    vec3  viewCoarse =  viewPacked[3].xyz;
    vec3 modelFine = (transpose(modelPacked))[3].xyz;
    vec3  viewFine = (transpose( viewPacked))[3].xyz;
    vec3 displacement = precise_sum(modelCoarse,modelFine,-viewCoarse,-viewFine);
    mat4 modelView = mat4( mat3(viewPacked) * mat4x3(modelAffine[0],modelAffine[1],modelAffine[2],displacement) );
    viewPosition  = modelView*vertexPosition;
    screenPosition = projectionMatrix * viewPosition;
    return screenPosition;
}

Out of curiosity, what would be the advantage and disadvantages of just pre-computing the vec3 displacements on the CPU in doubles and then send it to the shader as the relative difference?

clayjohn · 2023-07-07T10:07:04Z

Out of curiosity, what would be the advantage and disadvantages of just pre-computing the vec3 displacements on the CPU in doubles and then send it to the shader as the relative difference?

The reason to use the specialized addition operations is you minimize the error loss from the operations. You do that by tracking the error from every operation, aggregating that error and then adding it back in during subsequent operations. The net result is an overall reduction in error. If you simply take the original error term in add it back in at the very end, you will still have the cumulative error that came from the operations. In other words, that approach would be better than nothing, but it would not be as effective as the full solution used here

expikr · 2023-07-07T10:14:50Z

The reason to use the specialized addition operations is you minimize the error loss from the operations. You do that by tracking the error from every operation, aggregating that error and then adding it back in during subsequent operations. The net result is an overall reduction in error. If you simply take the original error term in add it back in at the very end, you will still have the cumulative error that came from the operations. In other words, that approach would be better than nothing, but it would not be as effective as the full solution used here

Thanks, but it didn't quite answer my question. I'm asking about why not just compute the modelpos - viewpos subtractions entirely on CPU in doubles, which unequivocally will result in better precision than any perfectly-compensated two-float operations, because doubles have 53 bit mantissa whereas a perfectly computed two-float is 48 bits mantissa. I was thinking if there are performance disadvantages in mesh instancing with the cpu-side approach as you'd need to re-send the position info of every single instance every frame.

clayjohn · 2023-07-09T11:43:41Z

Ah, thanks for clarifying. For certain renderers it may be fully possible to pre-multiply the transforms using doubles and only send a baked modelviewmatrix. This would be optimal as far as precision goes. Godot can't use that approach for two reasons:

Instance transforms (AKA model transform) are cached on the GPU when a model is updated. This saves a ton of GPU bandwidth as we don't need to pass a transform per mesh per frame from the CPU to the GPU. If you pre-multiply the transform on CPU, you have to pass the entire transform per-mesh per-frame which is very slow
Godot lets users write their own vertex shader, so users need to be able to access the pre-transformed transforms. Accordingly, pre-calculating the MVP means sending an additional transform which, again, is really bad for bandwidth usage.

Overall, it is preferable to do a few extra vertex calculations than to send more data from the CPU to the GPU for each mesh for each frame

albinaask · 2023-07-11T12:26:11Z

Out of curiosity, does the vertex shader compute the VP matrix every time or is this updated once per viewport from the CPU side? Would this be a performance concern since I suppose GPUs have a one cycle 4x4 matrix multiplication operation in the SFU?

expikr · 2024-04-28T13:28:18Z

Ah, thanks for clarifying. For certain renderers it may be fully possible to pre-multiply the transforms using doubles and only send a baked modelviewmatrix. This would be optimal as far as precision goes. Godot can't use that approach for two reasons:

Instance transforms (AKA model transform) are cached on the GPU when a model is updated. This saves a ton of GPU bandwidth as we don't need to pass a transform per mesh per frame from the CPU to the GPU. If you pre-multiply the transform on CPU, you have to pass the entire transform per-mesh per-frame which is very slow

Godot lets users write their own vertex shader, so users need to be able to access the pre-transformed transforms. Accordingly, pre-calculating the MVP means sending an additional transform which, again, is really bad for bandwidth usage.

Overall, it is preferable to do a few extra vertex calculations than to send more data from the CPU to the GPU for each mesh for each frame

I would like to clarify that I did not mean calculating the mvp on cpu, but rather purely the world space position delta relative to the camera.

Here is future me answering past me:

Even without needing to recalculate the matrix for every single object on cpu, there is still a deal breaking fact that it means you will also be updating the positions of all static objects on change of your camera position.

Even for dynamic objects or skeletal animations they are not necessarily always updated every single frame, whereas with position being tied to your camera a simple walkaround results in constantly flushing out the instance data.

clayjohn added bug topic:rendering topic:3d labels Sep 20, 2022

clayjohn added this to the 4.0 milestone Sep 20, 2022

clayjohn requested a review from a team as a code owner September 20, 2022 21:37

clayjohn force-pushed the double-precision-rendering branch from ab1ed2e to b318b0f Compare September 21, 2022 00:37

Emulate double precision for regular rendering operation.

27a3014

We calculate the lost precision on the CPU and pass it into the GPU so that it can calculate an error-corrected version of the vertex position

clayjohn force-pushed the double-precision-rendering branch from b318b0f to 27a3014 Compare September 21, 2022 06:40

fire approved these changes Sep 28, 2022

View reviewed changes

BastiaanOlij approved these changes Sep 30, 2022

View reviewed changes

akien-mga merged commit 67961d8 into godotengine:master Sep 30, 2022

Calinou mentioned this pull request Oct 14, 2022

Rename float=64 SCons option to precision=double #67399

Merged

clayjohn deleted the double-precision-rendering branch October 14, 2022 17:09

clayjohn mentioned this pull request Oct 17, 2022

GPUParticles3D broken on float=64 builds #67545

Closed

fire mentioned this pull request Oct 18, 2022

Default float=64 to be on (double precision emulation) godotengine/godot-proposals#5610

Open

clayjohn mentioned this pull request Mar 7, 2023

MODEL_MATRIX cannot be modified from the vertex stage #74568

Open

expikr mentioned this pull request Jul 4, 2023

Some 3D matrix math... libsdl-org/SDL#7906

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emulate double precision for regular rendering operation when REAL_T_IS_DOUBLE #66178

Emulate double precision for regular rendering operation when REAL_T_IS_DOUBLE #66178

clayjohn commented Sep 20, 2022 •

edited

Loading

BastiaanOlij commented Sep 21, 2022

TechnoPorg commented Sep 22, 2022

clayjohn commented Sep 22, 2022

TechnoPorg commented Sep 22, 2022

fire left a comment

BastiaanOlij left a comment

akien-mga commented Sep 30, 2022

albinaask commented Sep 30, 2022

Calinou commented Sep 30, 2022 •

edited

Loading

albinaask commented Oct 1, 2022

realkotob commented Oct 17, 2022

clayjohn commented Oct 17, 2022

expikr commented Jul 2, 2023 •

edited

Loading

clayjohn commented Jul 6, 2023

expikr commented Jul 7, 2023 •

edited

Loading

clayjohn commented Jul 7, 2023

expikr commented Jul 7, 2023 •

edited

Loading

clayjohn commented Jul 9, 2023

albinaask commented Jul 11, 2023

expikr commented Apr 28, 2024 •

edited

Loading

Emulate double precision for regular rendering operation when REAL_T_IS_DOUBLE #66178

Emulate double precision for regular rendering operation when REAL_T_IS_DOUBLE #66178

Conversation

clayjohn commented Sep 20, 2022 • edited Loading

BastiaanOlij commented Sep 21, 2022

TechnoPorg commented Sep 22, 2022

clayjohn commented Sep 22, 2022

TechnoPorg commented Sep 22, 2022

fire left a comment

Choose a reason for hiding this comment

BastiaanOlij left a comment

Choose a reason for hiding this comment

akien-mga commented Sep 30, 2022

albinaask commented Sep 30, 2022

Calinou commented Sep 30, 2022 • edited Loading

albinaask commented Oct 1, 2022

realkotob commented Oct 17, 2022

clayjohn commented Oct 17, 2022

expikr commented Jul 2, 2023 • edited Loading

clayjohn commented Jul 6, 2023

expikr commented Jul 7, 2023 • edited Loading

clayjohn commented Jul 7, 2023

expikr commented Jul 7, 2023 • edited Loading

clayjohn commented Jul 9, 2023

albinaask commented Jul 11, 2023

expikr commented Apr 28, 2024 • edited Loading

clayjohn commented Sep 20, 2022 •

edited

Loading

Calinou commented Sep 30, 2022 •

edited

Loading

expikr commented Jul 2, 2023 •

edited

Loading

expikr commented Jul 7, 2023 •

edited

Loading

expikr commented Jul 7, 2023 •

edited

Loading

expikr commented Apr 28, 2024 •

edited

Loading