Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to acquire synchronized frames from depth and main(1280x720) RGB cameras? #64

Closed
maurosyl opened this issue Oct 23, 2018 · 13 comments
Closed

Comments

@maurosyl
Copy link

Depth and RGB frames in the Hololens are not aligned. I'm trying to get information about the depth in the scene the Hololens user is seeing, but to do so i must establish a correspondence between the coordinates of the two kinds of frame, probably with some stereo calibration algorithm or something similar. The problem is that any way i can think of for achieving the alignment requires the RGB frames to be acquired "at the same time" as the ones i can easily get from the Recorder tool, any idea on how to do that?

@Huangying-Zhan
Copy link

Hi @mauronano, poses of the sensors in different timestamps are available in research mode. For depth maps, there is an unprojection model for the depth sensor as well. Therefore, Using the unprojection model, you can get 3D coordinates of the points in the depth map. Knowing the relative pose between depth sensor and RGB camera, you should be able to transform the 3D points to RGB camera coordinate system and project the points to RGB camera view, which means that you can create depth maps for RGB camera view. In this case, RGB frame and depth frame are not necessary to be acquired at the same time. However, if the acquired time gap is too large, there are some other issues, like occlusion.

@FracturedShader
Copy link

@mauronano, what @Huangying-Zhan said is exactly right. In my case I have an additional set of steps where I keep the latest frame from each of the streams, but only send the data from the last frame of each if the wearer hasn't moved very much in the last second. This is mainly to prevent blurry images, but also helps to make debugging a little bit easier. While the CameraStreamCoordinateMapper and CameraStreamCorrelation samples seem to indicate you can directly convert from one image to the other, I found this to simply not be true. Likely because those two samples use a stationary Kinect, which acts differently. In my case I ended up doing: ColorProjectionMatrix * ColorViewMatrix * DepthSpaceToColorSpace * InverseDepthViewMatrix * DepthCameraPoint. It's a whole ordeal, but all in all a pretty standard 2D->3D->3D->2D graphical space conversion pipeline. To get the DepthSpaceToColorSpace matrix you can use the TryGetTransformTo method with the two SpatialCoordinateSystems like they do in the MediaFrameReaderContext::FrameArrived function. Seeing in your previous question that you are using Unity, be sure to convert the matrices before doing anything with them. The Mixed Reality Toolkit has a preview sample that shows how to do that.

@FracturedShader
Copy link

I would like to add, the projection returned by the streams for the color image is very wrong. I ended up needing to fake it. Once you get it dialed in though, you can capture things as small as individual wires pretty well.

Thin wire
Junction box

Folds in fabric come out pretty well too. Keep in mind, the data is rough, and full of holes.

Shopping bag

@maurosyl
Copy link
Author

@Huangying-Zhan , @FracturedShader thank you both for your answers. Just to make sure if i understand what you suggest: after i manage to get the 3D coordinates of the unprojected depth pixel (which you refer to as "DepthCameraPoint") i use the cameraViewTransform matrix provided by the Recorder tool to map them to the depth camera space. Regarding the DepthSpaceToColorSpace, i'm not sure about the meaning of the "FrameToOrigin" matrix produced at the place in the code you pointed at, maybe grasping that could help me better understand the math behind this trasformation. As for the "ColorViewMatrix " which stores the camera extrinsics, should i edit the Recorder project to provide me that too? As at the moment it returns only the parameters of the low resolution RGB cameras and not the 1280x720 one.

@FracturedShader
Copy link

@mauronano, unfortunately I ended up writing my own custom application. I haven't actually tried to use the Recorder project directly. As a result I don't know much about how it specifically works. You may just have to poke around and make modifications as you see fit to get the data you need.

@Huangying-Zhan
Copy link

@mauronano, suppose you have 3D points in Depth camera space, then if you want to get 3D points in Color camera space, you need the relative pose between depth camera and color camera. From the recorder app, there is an example of getting the absolute pose for each sensor, which you can check from here. Eventually, you can get the relative pose from the absolute poses.

@maurosyl
Copy link
Author

I want to thank you both very much for your answers as i'm only approaching to computer vision and the stuff i find online is still pretty obscure to me. I have one last question, can i only find the alignment once, offline, and then use it to map every depth frame on my rgb frames or it is something i should do continuously?

@Huangying-Zhan
Copy link

@mauronano, given that the depth frames and rgb frames are generally taken at different timestamps, which means that, if you have many RGB-D pairs, the relative poses (between RGB and D) for these pairs are not constant and you can't use a single transformation for all the alignments. Therefore, you need to do this for each RGB-D pair. If RGB-D pairs are always taken at same timestamp, then a single transformation is enough for all the alignments. Unfortunately, it is not the case in here.

@pranaabdhawan
Copy link

@mauronano were you able to convert between the 2 co-ordinate systems? I tried the mapping by going from 2D depth -> World States -> 2d rgb, using the matrices provided by the csv file in the recorder app. However I am following this is to project into the rgb space: https://docs.microsoft.com/en-us/windows/mixed-reality/locatable-camera. Somehow I see the point clouds for the 2 views but they have some offset in each coordinate. Maybe the multiplication by 0.5 etc. as provided in the shader code example (above link) should not be exactly implemented. Do you know the correct approach? Thanks!

@cyberj0g
Copy link

For anyone who comes from Google, see my implementation of such mapping. Might not be the best code, but works fast and accurate enough for my purpose.

@LisaVelten
Copy link

Hi everyone,

I am working on the same problem you have. I cannot get my depth images and hd images aligned. The following picture visualizes my problem. I try to align a calibration pattern. First I filter the 3D depth points, which belong to the calibration pattern, then I project these points into the HD image.

CalibrationPattern_ViewPointLeft

To figure out the problem I investigated in some detail the meaning of the tranformation matrices. In the following I outline my understanding of the transformation matrices. Then I describe how I try to align my images. I would appreciate your help very much!

1. CameraCoordinateSystem (MFSampleExtension_Spatial_CameraCoordinateSystem)
In the example HoloLensForCV this coordinate system is used to obtain the transformation "FrameToOrigin". The FrameToOrigin transformation is obtained by transforming the CameraCoordinateSystem to the OriginFrameOfReference. (line 140-142 in MediaFrameReaderContext.cpp)

I still do not exactly know what is described by this transformation. What is meant by "frame"?

Through experimenting I found out that the translation vector changes when moving. In fact, the changes do make sense: If I move forward, the z-component becomes smaller. This agrees with the coordinate system in the image below. The z-axis is pointing in the opposite direction of the image plane.

coordinatesystems

The same applies for moving left or right: moving right makes the x component increase. The y component is about stable. This makes sense as I am not moving up or down.
What I am really uncertain about is the rotational part of the transformation matrix. The rotational part is almost an Identity Matrix. The rotation of my head seems to be contained in the CameraViewTransform, which I describe in the second point.

As far as I understand, the FrameToOrigin Matrix looks as follows:
[1, 0, 0, 0,
0, 1, 0, 0,
0, 0, 1, 0,
x, y, z, 1]

For me the "FrameToOrigin" seems to describe the relation between a fixed point on the HoloLens to the Origin (The Origin is defined each time the app is started, this helps to map each frame to a frame of reference). In the Image above the Origin is probably the "App-specific Coordinate System".

2. CameraViewTransform (MFSampleExtension_Spatial_CameraViewTransform )
The CameraViewTransform is directly saved with each frame (In contrast to FrameToOrigin, no transformation is neccessary).

The rotation of the head seems to be saved within the rotational part of this matrix. I testes this by moving my head around the y-axis. If I turn about 180 ° to the right around my y-axis, the rotational part looks as follows:
[0, 0, 1,
0, 1, 0,
-1, 0, 0].
This corresponds to a 180° rotation around the y-axis - what we expect..

The translational part seems to stay about stable. This would make sense if the translational part described the translation between the fixed point on the HoloLens and the respective camera (hd or depth). However, I would expect the translational part to stay exactly equal. This is not the case. The translational part is only "about" equal and not exactly.

If I do not turn my head (rotational part is an Identity Matrix) the CameraViewTransform looks as follows:

CameraViewTransform for HD Camera
[1, 0, 0, 0,
0, 1, 0, 0,
0, 0, 1, 0,
0.00631712, -0.184793, 0.145006, 1]

CameraViewTransform for Depth Camera
[1, 0, 0, 0,
0, 1, 0, 0,
0, 0, 1, 0,
0.00798517, -0.184793, 0.0537722, 1]

So the CameraViewTransform seems to capture the rotation of the users head. What is captured by the translational part? If the translational part is the distance between a fixed point on the HoloLens and the respective camera - why is the translational part not always exactly equal?

3. CameraProjectionTransform (MFSampleExtension_Spatial_CameraProjectionTransform)
This transformation is described on the following github page:
https://github.com/MicrosoftDocs/mixed-reality/blob/5b32451f0fff3dc20048db49277752643118b347/mixed-reality-docs/locatable-camera.md

However, what is still unclear: What is the meaning of the terms A and B?

My aim is to map between the depth camera and the hd camera of the HoloLens. To do this I do the following:

  1. I record images with the Recorder Tool of the HoloLensForCV sample
  2. I take a depth image and look for the corresponding hd image by checking the timestamps.
  3. I use the unprojection mapping to find the 3D points in the CameraViewSpace of the Depth Camera.
  4. I transform the 3D points from the Depth Camera View to the HD Camera View and project them onto the image plane I use the following transformations:
    Pixel Coordinates = [3D depth point,1] * inv(CameraViewTransform_Depth) * FrameToOrigin_Depth * inv(FrameToOrigin_HD) * CameraViewTransform_HD * CameraProjectionTransfrom_HD
These Pixel Coordinates are in the range from -1 to 1 and need to be adjusted to the image size of 720x1280. This is done as follows:
 x_rgb = 1280 * (PixelCoordinates.x + 1) / 2;
 y_rgb = 720 * (1 - ((PixelCoordinates.y +1)/2));

Result: When transforming my detections from the depth camera to the hd image camera, the objects (In this case the Calibration Pattern) are not 100 % aligned. So I am trying to figure out where the misalignment is coming from. Am I understanding the transformation matrices wrong or has anyone experienced similar problems?

The problems might occur if the spatial mapping of the HoloLens is not 100% working correctly. This might happen If the HoloLens cannot find enough features to map the room. Thus, I tested my setup in different rooms. Especially, in smaller rooms with more clutter in the background (such that the HoloLens can find more features). However, the problem still occurs. As I outlined above, the rough appearance of the transformations seems to be correct. I do not have any idea of how to test the transformation matrices further to grasp the problem.

I would appreciate your help very much! Thanks a lot in advance!
Lisa

@cxnvcarol
Copy link

Hello. I haven't tried it myself, but it looks like you're very close to the solution. Could the error come from which matrix are you using for the reprojection? (each camera has it's own CameraViewTransform and FrameToOriginTransform even if they're very close to each other).

Also, if it helps, I personally have found useful to check the ARUco Sample in the HololensForCV project in order to understand how to use these matrices. In the attachment I've extracted the important lines. In this case they have the correspondence already and they're triangulating the 3d point back to the world coordinates. Let us know if you find the solution, I'll apprecciate it.
2DPairTo3DPipelineHololens_arucoSample.pdf

@LisaVelten
Copy link

I thought it might be better to open a a new case (#119 ), as this one is already closed. I found a solution for the correct alignment, which I noted down in the comments. The exact content of the transformation matrices is still unclear to me. So the case #119 is still open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants