Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emulate double precision for regular rendering operation when REAL_T_IS_DOUBLE #66178

Merged
merged 1 commit into from
Sep 30, 2022

Conversation

clayjohn
Copy link
Member

@clayjohn clayjohn commented Sep 20, 2022

Fixes: #58516
Finishes @fire 's 4 years 10 months 30 days journey of double support: #12299

We calculate the lost precision on the CPU and pass it into the GPU so that it can calculate an error-corrected version of the vertex position. The general approach is detailed here: http://andrewthall.org/papers/df64_qf128.pdf

This allows rendering very large worlds as if we were using double precision in the shader when we are not.

We don't use doubles in the shader for two reasons:

  1. Limited support for doubles. Metal shading language doesn't support them at all. On some devices using doubles in shaders causes them to crash (thats the case on my intel integrated GPU)
  2. Even when GPUs do support doubles, performance can be poor

This method works for "normal" rendering meaning it does not work with the render modes skip_vertex_transform or world_vertex_coords in either case you end up doing calculations totally in floating point and you lose the benefit of this. By the same token, any shader operations in world space will still be a problem as the shaders are 100% floats. This means that world triplanar is still limited by the bounds of single precision floats.

I have not implemented this yet in GLES3 as I want to flesh out the 3D renderer a bit more first. Particularly, I want to ensure that this won't conflict with our floating point precision needs.

Lastly, as it is written, this approach does not work with particles/multimeshes. The same general approach can be used, but when applied to particles, this approach will result in many many times more calculations. Right now, for smaller ranges (<500km) this isn't a big deal, but once you approach 1000km the error becomes noticeable and above 10,000km the error is significant. I can add the relevant code for particles/multimeshes if desired.

Edit: I was wrong above, I thought that we would need to add a high precision path to the instance_transform * model matrix multiplication which would require decomposing the multiplication into a mat3 x mat3 multiplication and a high precision dot product (which would be a ton of calculations). However, I realized that when using the normal render path, we could keep the instance_transform separate and add its origin offset the same time we do the model/view multiplication. So I've added a code path that ensures the full model matrix is available if MODEL_MATRIX is read in the shader, if using world vertex coords, if skipping vertex transform, or if not using doubles. Finally, I made MODEL_MATRIX read only in shaders. Previously you could write to it and the value would be totally ignored.

CC @Zylann @reduz @BastiaanOlij

Comparison

At origin
Before: (Ignore the triangle, it moves to show smooth particle movement)

Screenshot from 2022-09-20 17-52-35

After:

Screenshot from 2022-09-20 17-21-13

1 billion units away!
Before:

Screenshot from 2022-09-20 17-52-30

After:

Screenshot from 2022-09-20 17-21-09

@BastiaanOlij
Copy link
Contributor

Took me a minute to wrap my brain around it but this is a deceptively simple solution to a gnarly problem. Really cool!

We calculate the lost precision on the CPU and pass it into the GPU
so that it can calculate an error-corrected version of the vertex position
@TechnoPorg
Copy link
Contributor

I hope this isn't a ridiculous question, but should/can this be implemented for 2D as well? Based on my limited testing, it seems like 2D rendering also starts to jitter when getting millions of units out.

@clayjohn
Copy link
Member Author

I hope this isn't a ridiculous question, but should/can this be implemented for 2D as well? Based on my limited testing, it seems like 2D rendering also starts to jitter when getting millions of units out.

I never even considered 2D. I guess it can be added if there is demand, but I wouldn't add it without reason. This code gets inserted in all qualifying shaders in the doubles build of the engine, so it has the potential to reduce performance even for normal UI stuff that would never need it.

@TechnoPorg
Copy link
Contributor

Sounds good, that makes sense!

Copy link
Member

@fire fire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not part of the rendering team which owns this, but I did successfully test with the sample project given by @Zylann and I did did the first pr on double precision.

Copy link
Contributor

@BastiaanOlij BastiaanOlij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing this in chat, feedback so far seems that this is working as advertised.

I had some concerns initially but after discussing with Clayjohn I think this is a very elegant solution.

Code wise I have nothing to add, maybe @reduz will want to have a final say but I think this is good to go.

@akien-mga akien-mga merged commit 67961d8 into godotengine:master Sep 30, 2022
@akien-mga
Copy link
Member

Thanks!

@albinaask
Copy link
Contributor

For reference and for clarity, is this only applicable when building godot with "bits = 64"? Or am I misunderstanding something?

@Calinou
Copy link
Member

Calinou commented Sep 30, 2022

For reference and for clarity, is this only applicable when building godot with "bits = 64"? Or am I misunderstanding something?

No, float=64 can be used with 32-bit builds too. However, this is ill-advised as float=64 adds some computation overhead and 32-bit-only CPUs are often very slow.

@albinaask
Copy link
Contributor

I agree, my question was rather whether the transformation matrix on the CPU-side must be computed in 64 bits in order to apply an adequate correction in the shader? And is this only done when Godot is compiled with "bits=64"?

@clayjohn clayjohn deleted the double-precision-rendering branch October 14, 2022 17:09
@realkotob
Copy link
Contributor

This is cool!

Will this be enabled on default for official stable builds?

@clayjohn
Copy link
Member Author

@realkotob No, this is only applicable for "doubles" builds of the engine

@expikr
Copy link

expikr commented Jul 2, 2023

Hey, I tried to implement this in Love2D, do you guys think I've got it right?

https://github.com/groverburger/g3d/pull/45/files

I basically packaged the residual component into the unused bottom three zeroes of the 4x4 matrix like so:

packagedMatrix = 
affine_xx , affine xy , affine_xz , coarse_x ;
affine_yx , affine yy , affine_yz , coarse_y ;
affine_zx , affine zy , affine_zz , coarse_z ;
residue_x , residue_y , residue_z ,    1     ;

Then unpacked them in the shader this way:

    mat4 diff = modelPacked - viewPacked; // viewpacked's third column is actually in worldspace
    vec3 displacement = diff[3].xyz + vec3(diff[0].w,diff[1].w,diff[2].w);
    mat3 modelAffine = mat3(modelPacked);
    mat4 modelView = mat4( mat3(viewPacked) * mat4x3(modelAffine[0],modelAffine[1],modelAffine[2],displacement) );
    worldPosition = vec4(modelAffine*vertexPosition.xyz, 0 ) + modelPacked[3];
    viewPosition  = modelView*vertexPosition;
    screenPosition = projectionMatrix * viewPosition;

@clayjohn
Copy link
Member Author

clayjohn commented Jul 6, 2023

Hey, I tried to implement this in Love2D, do you guys think I've got it right?

https://github.com/groverburger/g3d/pull/45/files

I basically packaged the residual component into the unused bottom three zeroes of the 4x4 matrix like so:

packagedMatrix = 
affine_xx , affine xy , affine_xz , coarse_x ;
affine_yx , affine yy , affine_yz , coarse_y ;
affine_zx , affine zy , affine_zz , coarse_z ;
residue_x , residue_y , residue_z ,    1     ;

Then unpacked them in the shader this way:

    mat4 diff = modelPacked - viewPacked; // viewpacked's third column is actually in worldspace
    vec3 displacement = diff[3].xyz + vec3(diff[0].w,diff[1].w,diff[2].w);
    mat3 modelAffine = mat3(modelPacked);
    mat4 modelView = mat4( mat3(viewPacked) * mat4x3(modelAffine[0],modelAffine[1],modelAffine[2],displacement) );
    worldPosition = vec4(modelAffine*vertexPosition.xyz, 0 ) + modelPacked[3];
    viewPosition  = modelView*vertexPosition;
    screenPosition = projectionMatrix * viewPosition;

It looks like you are missing the double_add_vec3() function in the shader. double_add_vec3() is the key behind ensuring that you retain as much precision as possible. You cannot use regular addition as you lose precision in doing so

@expikr
Copy link

expikr commented Jul 7, 2023

Interesting, did some reading about the 2sum and fast2sum algorithms. I tweaked the four-vector addition to a less generalized version that discards the very last error term, as you have done in your shaders for the final position:

vec3 two_sum(vec3 a, vec3 b, out vec3 out_p) {
    vec3 s = a + b;
    vec3 v = s - a;
    out_p = (a - (s - v)) + (b - v);
    return s;
}
vec3 precise_sum(vec3 A, vec3 a, vec3 B, vec3 b) {
    vec3 D,d;
    vec3 C = two_sum(A,B,D);
    vec3 c = two_sum(a,b,d);
    vec3 CcD = C + (c+D);
    vec3 e = (c+D) - (CcD-C);
    return CcD + (d+e);
}

vec4 position(mat4 transformProjection, vec4 vertexPosition) {
    mat3 modelAffine = mat3(modelPacked);
    vec3 modelCoarse = modelPacked[3].xyz;
    vec3  viewCoarse =  viewPacked[3].xyz;
    vec3 modelFine = (transpose(modelPacked))[3].xyz;
    vec3  viewFine = (transpose( viewPacked))[3].xyz;
    vec3 displacement = precise_sum(modelCoarse,modelFine,-viewCoarse,-viewFine);
    mat4 modelView = mat4( mat3(viewPacked) * mat4x3(modelAffine[0],modelAffine[1],modelAffine[2],displacement) );
    viewPosition  = modelView*vertexPosition;
    screenPosition = projectionMatrix * viewPosition;
    return screenPosition;
}

Out of curiosity, what would be the advantage and disadvantages of just pre-computing the vec3 displacements on the CPU in doubles and then send it to the shader as the relative difference?

@clayjohn
Copy link
Member Author

clayjohn commented Jul 7, 2023

Out of curiosity, what would be the advantage and disadvantages of just pre-computing the vec3 displacements on the CPU in doubles and then send it to the shader as the relative difference?

The reason to use the specialized addition operations is you minimize the error loss from the operations. You do that by tracking the error from every operation, aggregating that error and then adding it back in during subsequent operations. The net result is an overall reduction in error. If you simply take the original error term in add it back in at the very end, you will still have the cumulative error that came from the operations. In other words, that approach would be better than nothing, but it would not be as effective as the full solution used here

@expikr
Copy link

expikr commented Jul 7, 2023

The reason to use the specialized addition operations is you minimize the error loss from the operations. You do that by tracking the error from every operation, aggregating that error and then adding it back in during subsequent operations. The net result is an overall reduction in error. If you simply take the original error term in add it back in at the very end, you will still have the cumulative error that came from the operations. In other words, that approach would be better than nothing, but it would not be as effective as the full solution used here

Thanks, but it didn't quite answer my question. I'm asking about why not just compute the modelpos - viewpos subtractions entirely on CPU in doubles, which unequivocally will result in better precision than any perfectly-compensated two-float operations, because doubles have 53 bit mantissa whereas a perfectly computed two-float is 48 bits mantissa. I was thinking if there are performance disadvantages in mesh instancing with the cpu-side approach as you'd need to re-send the position info of every single instance every frame.

@clayjohn
Copy link
Member Author

clayjohn commented Jul 9, 2023

Ah, thanks for clarifying. For certain renderers it may be fully possible to pre-multiply the transforms using doubles and only send a baked modelviewmatrix. This would be optimal as far as precision goes. Godot can't use that approach for two reasons:

  1. Instance transforms (AKA model transform) are cached on the GPU when a model is updated. This saves a ton of GPU bandwidth as we don't need to pass a transform per mesh per frame from the CPU to the GPU. If you pre-multiply the transform on CPU, you have to pass the entire transform per-mesh per-frame which is very slow
  2. Godot lets users write their own vertex shader, so users need to be able to access the pre-transformed transforms. Accordingly, pre-calculating the MVP means sending an additional transform which, again, is really bad for bandwidth usage.

Overall, it is preferable to do a few extra vertex calculations than to send more data from the CPU to the GPU for each mesh for each frame

@albinaask
Copy link
Contributor

Out of curiosity, does the vertex shader compute the VP matrix every time or is this updated once per viewport from the CPU side? Would this be a performance concern since I suppose GPUs have a one cycle 4x4 matrix multiplication operation in the SFU?

@expikr
Copy link

expikr commented Apr 28, 2024

Ah, thanks for clarifying. For certain renderers it may be fully possible to pre-multiply the transforms using doubles and only send a baked modelviewmatrix. This would be optimal as far as precision goes. Godot can't use that approach for two reasons:

  1. Instance transforms (AKA model transform) are cached on the GPU when a model is updated. This saves a ton of GPU bandwidth as we don't need to pass a transform per mesh per frame from the CPU to the GPU. If you pre-multiply the transform on CPU, you have to pass the entire transform per-mesh per-frame which is very slow
  2. Godot lets users write their own vertex shader, so users need to be able to access the pre-transformed transforms. Accordingly, pre-calculating the MVP means sending an additional transform which, again, is really bad for bandwidth usage.

Overall, it is preferable to do a few extra vertex calculations than to send more data from the CPU to the GPU for each mesh for each frame

I would like to clarify that I did not mean calculating the mvp on cpu, but rather purely the world space position delta relative to the camera.

Here is future me answering past me:

Even without needing to recalculate the matrix for every single object on cpu, there is still a deal breaking fact that it means you will also be updating the positions of all static objects on change of your camera position.

Even for dynamic objects or skeletal animations they are not necessarily always updated every single frame, whereas with position being tied to your camera a simple walkaround results in constantly flushing out the instance data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Vulkan: Objects close to the camera jitter a lot when far from the origin (even with float=64)
9 participants