Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about rend_util.py #12

Closed
DavidXu-JJ opened this issue Dec 3, 2022 · 3 comments
Closed

Some questions about rend_util.py #12

DavidXu-JJ opened this issue Dec 3, 2022 · 3 comments

Comments

@DavidXu-JJ
Copy link

Hi, thank you for your decent work. I try to follow your work recently and I meet some problems which I wish to get answers from this issue.

  1. First question:
    In function load_K_Rt_from_P at line 48 in rend_util.py:
    pose = np.eye(4, dtype=np.float32)
    pose[:3, :3] = R.transpose()
    pose[:3,3] = (t[:3] / t[3])[:,0]

    This code really makes me confused and I'm not able to give an explanation to it.
    I read the following code at line 78 in rend_util.py:
    pixel_points_cam = lift(x_cam, y_cam, z_cam, intrinsics=intrinsics)
    # permute for batch matrix product
    pixel_points_cam = pixel_points_cam.permute(0, 2, 1)
    world_coords = torch.bmm(p, pixel_points_cam).permute(0, 2, 1)[:, :, :3]

    It seems that you use pose as a cameraToWorld matrix.
    I did an experiment in advance, the following code is from stackoverflow:
k = np.array([[631,   0, 384],
              [  0, 631, 288],
              [  0,   0,   1]])
r = np.array([[-0.30164902,  0.68282439, -0.66540117],
              [-0.63417301,  0.37743435,  0.67480953],
              [ 0.71192167,  0.6255351 ,  0.3191761 ]])
t = np.array([ 3.75082481, -1.18089565,  1.06138781])

C = np.eye(4)
C[:3, :3] = k @ r
C[:3, 3] = k @ r @ t

out = cv2.decomposeProjectionMatrix(C[:3, :])

If I convert r and t into a homogeneous coordinate, then I take the R@T, which is the worldToCamera matrix. I will get:

>>> T=np.eye(4)
>>> T[:3,3]=t
>>> R=np.eye(4)
>>> R[:3,:3]=r
>>> R@T
array([[-0.30164902,  0.68282439, -0.66540117, -2.64402567],
       [-0.63417301,  0.37743435,  0.67480953, -2.10814783],
       [ 0.71192167,  0.6255351 ,  0.3191761 ,  2.27037141],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])

Then if I take the inverse of R@T, which I think is the cameraToWorld matrix. I will get:

>>> np.linalg.inv((R@T))
array([[-0.30164902, -0.63417301,  0.71192166, -3.75082481],
       [ 0.6828244 ,  0.37743435,  0.6255351 ,  1.18089565],
       [-0.66540117,  0.67480953,  0.3191761 , -1.06138781],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])

This result seems that, to get the cameraToWorld matrix, we should concatenate the R^(-1) and -T, instead of R^(-1) and T referred in line 31 in rend_util.py:

pose = np.eye(4, dtype=np.float32)
pose[:3, :3] = R.transpose()
pose[:3,3] = (t[:3] / t[3])[:,0]

I don't know why it takes R^(-1) and T here.

  1. Second question:
    In function lift in line 96 in rend_util.py:
    def lift(x, y, z, intrinsics):
    # parse intrinsics
    intrinsics = intrinsics.cuda()
    fx = intrinsics[:, 0, 0]
    fy = intrinsics[:, 1, 1]
    cx = intrinsics[:, 0, 2]
    cy = intrinsics[:, 1, 2]
    sk = intrinsics[:, 0, 1]
    x_lift = (x - cx.unsqueeze(-1) + cy.unsqueeze(-1)*sk.unsqueeze(-1)/fy.unsqueeze(-1) - sk.unsqueeze(-1)*y/fy.unsqueeze(-1)) / fx.unsqueeze(-1) * z
    y_lift = (y - cy.unsqueeze(-1)) / fy.unsqueeze(-1) * z
    # homogeneous
    return torch.stack((x_lift, y_lift, z, torch.ones_like(z).cuda()), dim=-1)

    I don't know why the x_lift takes y and fy into consideration.
    It seems that sk should be 0, but I test it in runtime and I get:
intrinsics
tensor([[[ 2.8923e+03, -2.1742e-04,  8.2320e+02,  0.0000e+00],
         [ 0.0000e+00,  2.8832e+03,  6.1907e+02,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00]]],
       device='cuda:0')

It seems that sk is not 0. So the transformation becomes:

$$ \begin{bmatrix} x'\\y'\\z \end{bmatrix}= \begin{bmatrix} f_x&sk&c_x&0\\ 0&f_y&c_y&0\\ 0&0&1&0 \end{bmatrix} \begin{bmatrix} x\_lift\\y\_lift\\z\\1 \end{bmatrix} $$

Here [x,y,z,1] is the point in the camera coordinates.
I find that:

$$ x'=f_x \cdot x\_lift + sk \cdot y\_lift + c_x \cdot z $$

The actual result of x_lift is:

$$ x\_lift = \cfrac{x'-c_x \cdot z}{f_x} - sk \cdot y\_lift $$

But in rend_list.py, x_lift is like to be:

$$ x\_lift = \cfrac{(x'-c_x)\cdot z}{f_x} - sk \cdot y\_lift $$

So when z=1, the code is correct. Would it be better if it is simply changed to be:

x_lift = (x / z - cx.unsqueeze(-1) + cy.unsqueeze(-1)*sk.unsqueeze(-1)/fy.unsqueeze(-1) - sk.unsqueeze(-1)*y/fy.unsqueeze(-1)) / fx.unsqueeze(-1) * z

(/ z is added to the x)

The first question means more to me than the second question. Would you please explain the logic of pose matrix to me.

Hope this issue would help other people as well.

I try my best to express my question as clear as possible. If there's something unclear or wrong with me, please inform of me.

@DavidXu-JJ
Copy link
Author

The answer to the confusing Problem 1 is figured out,

world_coords = torch.bmm(p, pixel_points_cam).permute(0, 2, 1)[:, :, :3]
ray_dirs = world_coords - cam_loc[:, None, :]

Here in line 79, camera location is setting to be T vector:
cam_loc = pose[:, :3, 3]

However, the actual camera location is at -T vector. What matters in this function is the relative position between the pixel location and the camera location, so cameraToWorld matrix doesn't need to take the -T as its translation part.
I remain my opinion on Problem 2. But since it's not the crucial part, so I close this issue.
At last, I'm sorry for the annoying 'open' and 'close' of my issue.(I'm not very much familiar with the operation on the issue)
EOF

@raynehe
Copy link

raynehe commented May 16, 2023

@DavidXu-JJ Hi! Sorry to bother you. I encountered a similar problem related to DTU dataset's coordinate system convention, and I'm wondering if you know about it.

My dataset follows NeRF's coordinate system convention, that is OpenGL convention (x-axis to the right, y-axis upward, and z-axis backward along the camera’s focal axis).

My issue is, if I apply the dataset to VolSDF directly, the computed ray_dir is incorrect. I think the problem is in the rotation matrix, DTU/BlendedMVs might follow a different convention. But I couldn't find anything about the coordinate system convention of DTU dataset, do you know about this?

Thank you very much!

@DavidXu-JJ
Copy link
Author

@raynehe
coords
If I doesn't mess it up, I remember the most dataset follow OpenCV coords. Maybe you can try to simply reverse the y and z axis.
I'm sorry if my suggestion doesn't help or is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants