Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Details about the code in model.py #14

Open
taylover-pei opened this issue Aug 18, 2021 · 4 comments
Open

Details about the code in model.py #14

taylover-pei opened this issue Aug 18, 2021 · 4 comments

Comments

@taylover-pei
Copy link

Thanks a lot for sharing the code. You have done a great work!

I have some questions about your code: In the model.py file, can you provide more details about the get_geometry function and the voxel_pooling function? I'm so confused about how they actually work.

Thanks a lot!

@manueldiaz96
Copy link

manueldiaz96 commented Aug 22, 2022

The get_geometry function uses the intrinsic and extrinsic matrices of the camera together with the different defined depths. This way we can see where each pixel is looking to using the projection ray.
The voxel_pooling method first does a cumulative sum of the features (akin to an integral image), then filters out which features look at the same cells (called ranks in the function) and removes them. Finally, finds the sum of the features projected over the cell by subtracting the cumulative values to the original cumulative ones shifting the array by one position. Let me give you an example with some pseudo code:

Lets say we have some features and do the cumulative sum:

feats = [ [1,1], [1,1], [2,2], [2,2], [0,0], [0,0],  [1,1], [1,1], [2,2], [2,2] ]
ft_cumsum = feats.cumsum(0)
>>> [[1,1], [2,2], [4,4], [6,6], [6,6], [6,6],  [7,7], [8,8], [10,10], [12,12] ]

Now, the rank array (which tell us which feats correspond to each cell in a flattened indexing) is filtered using kept by checking if there are repeated ranks. When they are repeated, only the right-most is kept, since this one has the sum of the features that fall in it:

ranks = [0,0,2,3,3,4,5,5,6,7]
kept = ones(feats.shape[0], dtype=bool)
kept[:-1] = (ranks[1:] != ranks[:-1])
>>> [False,  True,  True, False,  True,  True, False,  True,  True,  True]

So as you can see, since cells with indexes 0 and 1, 3 and 4, 6 and 7 fall into the same cells (cells 0, 3 and 5 respectively), we only keep the position in the array which has the sum of all the features that fall within the cell, this is why it is called cumulative sum pooling.

Having the features sum pooled, we just do the difference between the new feature tensor without the first features (since there aren't any before it) and itself removing the last position, which allows us to recover the real sum of the features easily:

ft_cumsum = ft_cumsum[kept]
ft_cumsum = cat(ft_cumsum[:1], ft_cumsum[1:] - ft_cumsum[:-1])
>>> [ [2,2], [2,2], [2,2], [0,0], [2,2], [2,2], [2,2] ]

Which if you do the sum of the features that fall into the same cell, you will find that they match with our result.

I know the answer was a bit late to the question, but I hope it'll help others!
Good luck!

@Deephome
Copy link

@manueldiaz96 Well done!

@VeeranjaneyuluToka
Copy link

VeeranjaneyuluToka commented Sep 23, 2022

@manueldiaz96 , Wondering if get_geometry implementation is based on some formula which you can point out me? thanks!

@manueldiaz96
Copy link

Take a look at this paper, where in equation 2 they describe how the projection is done from camera to 3D.

In their case, they get the depth from a stereo depth estimation. For Lift Splat Shoot, as I responded you on issue #31 , we are predicting the certainty the network has that the pixel is located at plane D. For LSS, we do not have only one depth value, we have a set of depths from 4m to 45m separated by 1m, which we project the scaled context vector by the classification score for each depth.

If you want to understand better how this works, modify the code in this line to multiply for a ones vector instead of x[:, self.D:(self.D + self.C)].unsqueeze(2), and see how does the variable x looks after being delivered by the get_voxels and before being processed by the bevencode module. Use matplotlib or another visualization library to see how the arrays look and you will see what I am explaining to you.

If you want to better understand this, I would recommend you see how an image is formed in a camera, to see how the geometry works to take something in 3D to project it back to a 2D image. I linked also on issue #31 a series of blogposts which explain this using the camera matrices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants