Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinate System Conventions #39

Closed
nikhilaravi opened this issue May 6, 2021 · 18 comments
Closed

Coordinate System Conventions #39

nikhilaravi opened this issue May 6, 2021 · 18 comments

Comments

@nikhilaravi
Copy link

nikhilaravi commented May 6, 2021

Can someone please clarify the conventions for the world-to-camera and camera projection transforms? In particular:

  • What is the world coordinate system for this dataset? In a previous issue it was mentioned that +X is down, +Y is to the right and +Z is outwards from the screen. Is this correct? Or is it +Y up and a left handed system (i.e. +X right, +Z into the screen) as mentioned in the paper?
  • Are the view_matrix and projection_matrix given assuming this world convention?
  • Is the projection_matrix given in terms of NDC or screen? i.e. do we need to convert fx, fy using the image width/height to get NDC values?
@nikhilaravi
Copy link
Author

nikhilaravi commented May 7, 2021

Also what's the difference between view_matrix and transform and intrinsics and projection_matrix?


transform = np.array(data.camera.transform).reshape(4, 4)
view = np.array(data.camera.view_matrix).reshape(4, 4)
intrinsics = np.array(data.camera.intrinsics).reshape(3, 3)
projection = np.array(data.camera.projection_matrix).reshape(4, 4)

It appears that the intrinsics have the focal length and principal point in screen space. transform and view_matrix appear to have the 2nd and 3d rows swapped. Can someone please explain what the conventions are? I need to convert from these coordinate frames to the system in my codebase.

@ahmadyan
Copy link
Collaborator

ahmadyan commented May 7, 2021

  1. The world coordinate +y is up (and aligned with gravity). You can see more information here

Objectron world coordinates

The device coordinate is the one you mentioned (xy is the device plane, and z is normal).

  1. The view matrix is in the world coordinate, describing the position of the current frame relative to first frame when ARSession tracking has started (this happends before recording the first frame in the video). Projection matrix is OpenGL projection matrix. To convert the two cooridinate system, you will need the projection_view matrix (i.e. projection @ view) and also take into account device orientation.

  2. Projection matrix has both the focal length and the image width/height. e.g. P[0][0] = focal_x / w/2, in our videos, w=1920, focal_x ~ 1500, so P[0][0] = 1.64911354e+00. Keep in mind that this projection matrix is computed for the original video resolution (typically 1920x1080). If you are using a down-sampled images in tf.examples, you'll need to adjust this.

image

This is the reference I use for Projection matrix.

Finally, there are two approaches how folks convert world coordinates to image coordinates. Graphic folks (including Objectron) usually use projection * view matrix, where as some multiview geometry/computer vision folks like to use intrinsic @ transform. Either one should work. If I recall, we should also have transform = inv(view).

In this tutorial, we show how to get pixels from 3D points using projection+view matrices.

Let me know if you have other questions. Happy to help.

@nikhilaravi
Copy link
Author

nikhilaravi commented May 7, 2021

Thanks very much for the clarification!

I am using the projection * view approach to go from world to NDC.

How do you take into account the device orientation? For the example in the tutorial, the images are loaded as shape (H, W) = (1920, 1440). If in the projection_matrix if it always assumes that w=1920 do you need to rotate the image so that it is correctly oriented to (H, W) = (1440, 1920) before applying the view and projection?

I'm trying to train a NeRF model using a video from Objectron so need to make sure the coordinate systems are aligned and the image size is also set correctly.

@ahmadyan
Copy link
Collaborator

ahmadyan commented May 7, 2021

Our images are 1080x1920, so w=1080, h=1920. Sorry for the typo above.

What we tend to do is to swap the xy in the output coordinates.
You have to ensure the image is oriented correctly. Our videos are recorded in portrait mode. If you use ffmpeg for extracting frames, ffmpeg is smart enough to understand that. However if you use OpenCV, opencv will not honor that bit and you have to manually rotate the images.

You can see how we use projection and view matrices to convert 3D points to image pixels here:

def project_points(points, projection_matrix, view_matrix, width, height):
    p_3d = np.concatenate((points, np.ones_like(points[:, :1])), axis=-1).T
    p_3d_cam = np.matmul(view_matrix, p_3d)
    p_2d_proj = np.matmul(projection_matrix, p_3d_cam)
    # Project the points
    p_2d_ndc = p_2d_proj[:-1, :] / p_2d_proj[-1, :]
    p_2d_ndc = p_2d_ndc.T

    # Convert the 2D Projected points from the normalized device coordinates to pixel values
    x = p_2d_ndc[:, 1]
    y = p_2d_ndc[:, 0]
    pixels = np.copy(p_2d_ndc)
    pixels[:, 0] = ((1 + x) * 0.5) * width
    pixels[:, 1] = ((1 + y) * 0.5) * height    
    pixels = pixels.astype(int)
    return pixels

Also if you want to train NeRF, we found out that the default camera poses (from AR sessions) are not accurate enough, so we ran an offline bundle adjuster to optimize it. The results are stored in sfm_data.pbdata, instead of geometry.pbdata. More info and some examples here and https://fig-nerf.github.io/.

@nikhilaravi
Copy link
Author

nikhilaravi commented May 7, 2021

Thanks! I'll try using the optimized camera poses instead.

So the projection and view transforms assume the image shape is w=1080, h=1920 but when inspecting the projection_matrix it seems that P[0][0] = focal_y / (H/2). Is this correct?
Similarly in the intrinsic matrix the values seem to be:

fy 0  py
0  fx px
0  0  1 

Instead of swapping the xy in the output values how should I modify the view and projection so that I get the xy values correctly as the output? i.e. should I swap the fx, fy and px, py position?

Also for the projection_matrix if you're using the OpenGL conventions do you also swap the direction of the z axis when going from camera -> NDC?

@nikhilaravi
Copy link
Author

Also is there an example of how to parse the sfm_data.pbdata files?

@ahmadyan
Copy link
Collaborator

ahmadyan commented May 8, 2021

  • sfm_data.pbdata is compatible with geometry.pbdata, so you can use the same code for parsing geometry.pbdata here.

  • I'm looking at one of the projection matrices (in this video, the recorded resolution was 1920x1440.
    We have f_x = f_y = 1588

P[0,0] = focal / (1920/2) = 1650 and P[1, 1] = focal / (h/2) = 1588 / (1440/2) = 2.2

So projection matrix assumes landscape mode.

projection_matrix = np.matrix(
  [[  1.64911354e+00,   0.00000000e+00,   1.38994455e-02,   0.00000000e+00],
  [  0.00000000e+00,   2.19881821e+00,  -2.26926804e-03,   0.00000000e+00],
  [  0.00000000e+00,   0.00000000e+00,  -9.99999762e-01,  -9.99999815e-04],
  [  0.00000000e+00,   0.00000000e+00,  -1.00000000e+00,   0.00000000e+00]])
  • That is actually an interesting question. I think swapping first and second row in projection matrix, and first and second column in the view matrix should suffice to swap x,y of the homogenized point, but I've to write it down to be sure. Same thing for the intrinsics (note you'll also need to swap the columns in the transform matrix as well. Also f_x, and f_y are the same in our dataset, but the principal points p_x and p_y should be close to the center of your image.
K = np.matrix(
  [[  1.58314905e+03, 0.00000000e+00, 9.46156555e+02],
  [  0.00000000e+00, 1.58314905e+03, 7.17866150e+02],
  [  0.00000000e+00, 0.00000000e+00, 1.00000000e+00]])

@nikhilaravi
Copy link
Author

nikhilaravi commented May 8, 2021

@ahmadyan thanks for clarifying that the projection matrix assumes landscape mode. This was confusing as the video orientation is portrait mode so if I assume the image size in the NeRF model is (H, W) = (1920, 1440) then the rays are not aligned correctly with the image. I will try swapping the rows and columns as you mentioned.

Regarding the sfm_data.pbdata is it at the same path as geometry.pbdata? I was not able to find the file e.g. https://storage.googleapis.com/objectron/videos/shoe/batch-39/12/sfm_data.pbdata gives me a 404 error.

@ahmadyan
Copy link
Collaborator

The correct filename is sfm_arframe.pbdata
So the path would be https://storage.googleapis.com/objectron/videos/shoe/batch-39/12/sfm_arframe.pbdata

There are 2 known issues which we are working to fix:

  1. Bundle adjuster failed on some sequences, so some sequences might miss this data.
  2. For some sequences, the number of frames in sfm_arframe.pbdata differs from geometry.pbdata and the number of frames in the video. We are working on a fix.

@nikhilaravi
Copy link
Author

@ahmadyan thanks a lot, I will try this out for a few videos!

@nikhilaravi
Copy link
Author

@ahmadyan how did you set the raysampling min/max depths for the NeRF model? Did you set it separately for each object video?

@ahmadyan
Copy link
Collaborator

We used the jax-nerf implementation.
You can take a look at the default parameters here, which sets near at 0.2 and far at 100m. This setting works well for chairs, but might create some issues with other categories.

@yehDilBeparwah
Copy link

@ahmadyan Thanks, This thread has been insightful as I didn't know that an adjustment had to be made as below.

If you are using a down-sampled images in tf.examples, you'll need to adjust this.

I have tried making that adjustment, for Projection Matrix (P) as
P[0][0] = 2 * fx / red_w
P[1][1] = 2 * fy / red_h
P[0][2] = 2 * ((red_w - 1) / 2 - cx ) / red_w
P[1][2] = 2 * (cy - (red_h - 1) / 2 ) / red_h
where red_w = 480, red_h = 640 and cx,cy = red_w/2,red_h/2
I follow through with homogenous division and following ndc, pixels conversion but still can't get numbers to match with 'hello world' example

2)More importantly, matrices like the view, projection matrices are device dependent. Wouldn't it be important to learn the first set rotations, translations ie conversion of unit-box to world co-ordinates.
Having learnt those, rest of the projection can be done via the projection system available on device.

@ahmadyan
Copy link
Collaborator

  1. our principal points are calibrated, so they are not exactly w/2 and h/2 (although that is always a good approximation). You can get the exact numbers from intrinsic matrix.
    1-b: If you just want to project points, simply use normalize device coordinates (uv-coordinates). That gives you pixels in range of [-1, 1] which later you can re-scale to the size of the image. If you want to use OpenGL pipeline, you'll need to modify the projection matrix.

  2. I'm afraid I don't fully understand this question. These parameters are calibrated per the recording device (i.e. Phone) and remain static for the dataset. If you train a model using this dataset and deploy it on a different device, then you'll need to update your projection matrix. That is why we like to use normalized coordinates, which are device independent.

@WeiChengTseng
Copy link

@ahmadyan Thanks for the information, This thread has been insightful for those who would like to apply Objectron in neural rendering.

Just wondering how to compute the distance between the tracked object and the camera.
My intuitive way to do that is shown below.

import numpy as np

dist = lambda a, b: ((a - b)**2).sum()**0.5

def read_pbdata(path):
    # read the annotation file
    with open(path, 'rb') as pb:
        sequence = annotation_protocol.Sequence()
        sequence.ParseFromString(pb.read())
    return sequence

anno_path = './Objectron/laptop/test/annotations/annotation_21.pbdata'
frame_idx = 0
annotations = read_pbdata(anno_path)

# take the object position (world coordinate) for object translation
obj_pos = np.array(annotations.objects[0].translation)

# annotation of a frame
frame_anno = annotations.frame_annotations[frame_idx]
camera = frame_anno.camera

# take the camera position from the view matrix
camera_pos = np.array(camera.view_matrix).reshape(4, 4)[:3, 3]

# calculate the distance between the camera and object
print(dist(camera_pos, obj_pos))

However, the result is not reasonable when I observe the video and estimate the distance between camera and object position. Is there other methods that can correctly calculate the distance between camera and object position?

@ahmadyan
Copy link
Collaborator

It is a lot simpler than that. The object has 9 key-point, the first keypoint is the center. The .z (as in the point_3d[2]) in the camera frame coordinate is the distance between the camera and the object.

Look at this tutorial and pay attention to this line

  object_keypoints_3d.append((keypoint.point_3d.x, keypoint.point_3d.y, keypoint.point_3d.z))

@finnweiler
Copy link

The object's annotation points are given in camera coordinates. How can those be converted into world coordinates where the world's origin sits right inside the object's centererpoint and is aligned with the coordinate axis?

@ahmadyan
Copy link
Collaborator

ahmadyan commented Jul 8, 2021

In general, if have the camera pose P and point x, you can apply inv(P) @ x to get the points out of the camera coordinates and into the world coordinates. I don't fully understand the second part of your question thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants