Coordinate System Conventions #39

nikhilaravi · 2021-05-06T05:35:20Z

Can someone please clarify the conventions for the world-to-camera and camera projection transforms? In particular:

What is the world coordinate system for this dataset? In a previous issue it was mentioned that +X is down, +Y is to the right and +Z is outwards from the screen. Is this correct? Or is it +Y up and a left handed system (i.e. +X right, +Z into the screen) as mentioned in the paper?
Are the view_matrix and projection_matrix given assuming this world convention?
Is the projection_matrix given in terms of NDC or screen? i.e. do we need to convert fx, fy using the image width/height to get NDC values?

The text was updated successfully, but these errors were encountered:

nikhilaravi · 2021-05-07T05:57:37Z

Also what's the difference between view_matrix and transform and intrinsics and projection_matrix?


transform = np.array(data.camera.transform).reshape(4, 4)
view = np.array(data.camera.view_matrix).reshape(4, 4)
intrinsics = np.array(data.camera.intrinsics).reshape(3, 3)
projection = np.array(data.camera.projection_matrix).reshape(4, 4)

It appears that the intrinsics have the focal length and principal point in screen space. transform and view_matrix appear to have the 2nd and 3d rows swapped. Can someone please explain what the conventions are? I need to convert from these coordinate frames to the system in my codebase.

ahmadyan · 2021-05-07T21:28:21Z

The world coordinate +y is up (and aligned with gravity). You can see more information here

Objectron world coordinates

The device coordinate is the one you mentioned (xy is the device plane, and z is normal).

The view matrix is in the world coordinate, describing the position of the current frame relative to first frame when ARSession tracking has started (this happends before recording the first frame in the video). Projection matrix is OpenGL projection matrix. To convert the two cooridinate system, you will need the projection_view matrix (i.e. projection @ view) and also take into account device orientation.
Projection matrix has both the focal length and the image width/height. e.g. P[0][0] = focal_x / w/2, in our videos, w=1920, focal_x ~ 1500, so P[0][0] = 1.64911354e+00. Keep in mind that this projection matrix is computed for the original video resolution (typically 1920x1080). If you are using a down-sampled images in tf.examples, you'll need to adjust this.

This is the reference I use for Projection matrix.

Finally, there are two approaches how folks convert world coordinates to image coordinates. Graphic folks (including Objectron) usually use projection * view matrix, where as some multiview geometry/computer vision folks like to use intrinsic @ transform. Either one should work. If I recall, we should also have transform = inv(view).

In this tutorial, we show how to get pixels from 3D points using projection+view matrices.

Let me know if you have other questions. Happy to help.

nikhilaravi · 2021-05-07T21:43:57Z

Thanks very much for the clarification!

I am using the projection * view approach to go from world to NDC.

How do you take into account the device orientation? For the example in the tutorial, the images are loaded as shape (H, W) = (1920, 1440). If in the projection_matrix if it always assumes that w=1920 do you need to rotate the image so that it is correctly oriented to (H, W) = (1440, 1920) before applying the view and projection?

I'm trying to train a NeRF model using a video from Objectron so need to make sure the coordinate systems are aligned and the image size is also set correctly.

ahmadyan · 2021-05-07T21:55:38Z

Our images are 1080x1920, so w=1080, h=1920. Sorry for the typo above.

What we tend to do is to swap the xy in the output coordinates.
You have to ensure the image is oriented correctly. Our videos are recorded in portrait mode. If you use ffmpeg for extracting frames, ffmpeg is smart enough to understand that. However if you use OpenCV, opencv will not honor that bit and you have to manually rotate the images.

You can see how we use projection and view matrices to convert 3D points to image pixels here:

def project_points(points, projection_matrix, view_matrix, width, height):
    p_3d = np.concatenate((points, np.ones_like(points[:, :1])), axis=-1).T
    p_3d_cam = np.matmul(view_matrix, p_3d)
    p_2d_proj = np.matmul(projection_matrix, p_3d_cam)
    # Project the points
    p_2d_ndc = p_2d_proj[:-1, :] / p_2d_proj[-1, :]
    p_2d_ndc = p_2d_ndc.T

    # Convert the 2D Projected points from the normalized device coordinates to pixel values
    x = p_2d_ndc[:, 1]
    y = p_2d_ndc[:, 0]
    pixels = np.copy(p_2d_ndc)
    pixels[:, 0] = ((1 + x) * 0.5) * width
    pixels[:, 1] = ((1 + y) * 0.5) * height    
    pixels = pixels.astype(int)
    return pixels

Also if you want to train NeRF, we found out that the default camera poses (from AR sessions) are not accurate enough, so we ran an offline bundle adjuster to optimize it. The results are stored in sfm_data.pbdata, instead of geometry.pbdata. More info and some examples here and https://fig-nerf.github.io/.

nikhilaravi · 2021-05-07T22:23:07Z

Thanks! I'll try using the optimized camera poses instead.

So the projection and view transforms assume the image shape is w=1080, h=1920 but when inspecting the projection_matrix it seems that P[0][0] = focal_y / (H/2). Is this correct?
Similarly in the intrinsic matrix the values seem to be:

fy 0  py
0  fx px
0  0  1

Instead of swapping the xy in the output values how should I modify the view and projection so that I get the xy values correctly as the output? i.e. should I swap the fx, fy and px, py position?

Also for the projection_matrix if you're using the OpenGL conventions do you also swap the direction of the z axis when going from camera -> NDC?

nikhilaravi · 2021-05-08T00:11:30Z

Also is there an example of how to parse the sfm_data.pbdata files?

ahmadyan · 2021-05-08T18:49:41Z

sfm_data.pbdata is compatible with geometry.pbdata, so you can use the same code for parsing geometry.pbdata here.
I'm looking at one of the projection matrices (in this video, the recorded resolution was 1920x1440.
We have f_x = f_y = 1588

P[0,0] = focal / (1920/2) = 1650 and P[1, 1] = focal / (h/2) = 1588 / (1440/2) = 2.2

So projection matrix assumes landscape mode.

projection_matrix = np.matrix(
  [[  1.64911354e+00,   0.00000000e+00,   1.38994455e-02,   0.00000000e+00],
  [  0.00000000e+00,   2.19881821e+00,  -2.26926804e-03,   0.00000000e+00],
  [  0.00000000e+00,   0.00000000e+00,  -9.99999762e-01,  -9.99999815e-04],
  [  0.00000000e+00,   0.00000000e+00,  -1.00000000e+00,   0.00000000e+00]])

That is actually an interesting question. I think swapping first and second row in projection matrix, and first and second column in the view matrix should suffice to swap x,y of the homogenized point, but I've to write it down to be sure. Same thing for the intrinsics (note you'll also need to swap the columns in the transform matrix as well. Also f_x, and f_y are the same in our dataset, but the principal points p_x and p_y should be close to the center of your image.

K = np.matrix(
  [[  1.58314905e+03, 0.00000000e+00, 9.46156555e+02],
  [  0.00000000e+00, 1.58314905e+03, 7.17866150e+02],
  [  0.00000000e+00, 0.00000000e+00, 1.00000000e+00]])

nikhilaravi · 2021-05-08T20:54:22Z

@ahmadyan thanks for clarifying that the projection matrix assumes landscape mode. This was confusing as the video orientation is portrait mode so if I assume the image size in the NeRF model is (H, W) = (1920, 1440) then the rays are not aligned correctly with the image. I will try swapping the rows and columns as you mentioned.

Regarding the sfm_data.pbdata is it at the same path as geometry.pbdata? I was not able to find the file e.g. https://storage.googleapis.com/objectron/videos/shoe/batch-39/12/sfm_data.pbdata gives me a 404 error.

ahmadyan · 2021-05-10T19:04:22Z

The correct filename is sfm_arframe.pbdata
So the path would be https://storage.googleapis.com/objectron/videos/shoe/batch-39/12/sfm_arframe.pbdata

There are 2 known issues which we are working to fix:

Bundle adjuster failed on some sequences, so some sequences might miss this data.
For some sequences, the number of frames in sfm_arframe.pbdata differs from geometry.pbdata and the number of frames in the video. We are working on a fix.

nikhilaravi · 2021-05-10T19:16:33Z

@ahmadyan thanks a lot, I will try this out for a few videos!

nikhilaravi · 2021-05-10T22:50:29Z

@ahmadyan how did you set the raysampling min/max depths for the NeRF model? Did you set it separately for each object video?

ahmadyan · 2021-05-10T23:22:18Z

We used the jax-nerf implementation.
You can take a look at the default parameters here, which sets near at 0.2 and far at 100m. This setting works well for chairs, but might create some issues with other categories.

yehDilBeparwah · 2021-05-12T19:19:24Z

@ahmadyan Thanks, This thread has been insightful as I didn't know that an adjustment had to be made as below.

If you are using a down-sampled images in tf.examples, you'll need to adjust this.

I have tried making that adjustment, for Projection Matrix (P) as
P[0][0] = 2 * fx / red_w
P[1][1] = 2 * fy / red_h
P[0][2] = 2 * ((red_w - 1) / 2 - cx ) / red_w
P[1][2] = 2 * (cy - (red_h - 1) / 2 ) / red_h
where red_w = 480, red_h = 640 and cx,cy = red_w/2,red_h/2
I follow through with homogenous division and following ndc, pixels conversion but still can't get numbers to match with 'hello world' example

2)More importantly, matrices like the view, projection matrices are device dependent. Wouldn't it be important to learn the first set rotations, translations ie conversion of unit-box to world co-ordinates.
Having learnt those, rest of the projection can be done via the projection system available on device.

ahmadyan · 2021-05-12T19:49:51Z

our principal points are calibrated, so they are not exactly w/2 and h/2 (although that is always a good approximation). You can get the exact numbers from intrinsic matrix.
1-b: If you just want to project points, simply use normalize device coordinates (uv-coordinates). That gives you pixels in range of [-1, 1] which later you can re-scale to the size of the image. If you want to use OpenGL pipeline, you'll need to modify the projection matrix.
I'm afraid I don't fully understand this question. These parameters are calibrated per the recording device (i.e. Phone) and remain static for the dataset. If you train a model using this dataset and deploy it on a different device, then you'll need to update your projection matrix. That is why we like to use normalized coordinates, which are device independent.

WeiChengTseng · 2021-05-14T11:35:59Z

@ahmadyan Thanks for the information, This thread has been insightful for those who would like to apply Objectron in neural rendering.

Just wondering how to compute the distance between the tracked object and the camera.
My intuitive way to do that is shown below.

import numpy as np

dist = lambda a, b: ((a - b)**2).sum()**0.5

def read_pbdata(path):
    # read the annotation file
    with open(path, 'rb') as pb:
        sequence = annotation_protocol.Sequence()
        sequence.ParseFromString(pb.read())
    return sequence

anno_path = './Objectron/laptop/test/annotations/annotation_21.pbdata'
frame_idx = 0
annotations = read_pbdata(anno_path)

# take the object position (world coordinate) for object translation
obj_pos = np.array(annotations.objects[0].translation)

# annotation of a frame
frame_anno = annotations.frame_annotations[frame_idx]
camera = frame_anno.camera

# take the camera position from the view matrix
camera_pos = np.array(camera.view_matrix).reshape(4, 4)[:3, 3]

# calculate the distance between the camera and object
print(dist(camera_pos, obj_pos))

However, the result is not reasonable when I observe the video and estimate the distance between camera and object position. Is there other methods that can correctly calculate the distance between camera and object position?

ahmadyan · 2021-05-14T17:23:20Z

It is a lot simpler than that. The object has 9 key-point, the first keypoint is the center. The .z (as in the point_3d[2]) in the camera frame coordinate is the distance between the camera and the object.

Look at this tutorial and pay attention to this line

  object_keypoints_3d.append((keypoint.point_3d.x, keypoint.point_3d.y, keypoint.point_3d.z))

finnweiler · 2021-07-02T10:12:33Z

The object's annotation points are given in camera coordinates. How can those be converted into world coordinates where the world's origin sits right inside the object's centererpoint and is aligned with the coordinate axis?

ahmadyan · 2021-07-08T16:22:45Z

In general, if have the camera pose P and point x, you can apply inv(P) @ x to get the points out of the camera coordinates and into the world coordinates. I don't fully understand the second part of your question thought.

jinlinyi mentioned this issue Jun 5, 2021

How to project 3D points annotations to 2D with CAMERA INTRINSIC MATRIX (instead of PROJECTION MATRIX)? #14

Closed

daeyun mentioned this issue Aug 6, 2021

Add NeRF tutorial notebook #49

Merged

ahmadyan closed this as completed Aug 23, 2021

weders mentioned this issue May 11, 2022

Loading poses into COLMAP #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinate System Conventions #39

Coordinate System Conventions #39

nikhilaravi commented May 6, 2021 •

edited

nikhilaravi commented May 7, 2021 •

edited

ahmadyan commented May 7, 2021 •

edited

nikhilaravi commented May 7, 2021 •

edited

ahmadyan commented May 7, 2021

nikhilaravi commented May 7, 2021 •

edited

nikhilaravi commented May 8, 2021

ahmadyan commented May 8, 2021

nikhilaravi commented May 8, 2021 •

edited

ahmadyan commented May 10, 2021

nikhilaravi commented May 10, 2021

nikhilaravi commented May 10, 2021

ahmadyan commented May 10, 2021

yehDilBeparwah commented May 12, 2021

ahmadyan commented May 12, 2021

WeiChengTseng commented May 14, 2021

ahmadyan commented May 14, 2021

finnweiler commented Jul 2, 2021

ahmadyan commented Jul 8, 2021

Coordinate System Conventions #39

Coordinate System Conventions #39

Comments

nikhilaravi commented May 6, 2021 • edited

nikhilaravi commented May 7, 2021 • edited

ahmadyan commented May 7, 2021 • edited

nikhilaravi commented May 7, 2021 • edited

ahmadyan commented May 7, 2021

nikhilaravi commented May 7, 2021 • edited

nikhilaravi commented May 8, 2021

ahmadyan commented May 8, 2021

nikhilaravi commented May 8, 2021 • edited

ahmadyan commented May 10, 2021

nikhilaravi commented May 10, 2021

nikhilaravi commented May 10, 2021

ahmadyan commented May 10, 2021

yehDilBeparwah commented May 12, 2021

ahmadyan commented May 12, 2021

WeiChengTseng commented May 14, 2021

ahmadyan commented May 14, 2021

finnweiler commented Jul 2, 2021

ahmadyan commented Jul 8, 2021

nikhilaravi commented May 6, 2021 •

edited

nikhilaravi commented May 7, 2021 •

edited

ahmadyan commented May 7, 2021 •

edited

nikhilaravi commented May 7, 2021 •

edited

nikhilaravi commented May 7, 2021 •

edited

nikhilaravi commented May 8, 2021 •

edited