-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the background loss filtering out dynamic objects? #6
Comments
Hi,
Yes. As the name of the loss indicates, this loss is mostly for the background static objects since the static background objects would overlap between two frames after transformation.
I am not quite sure what does the
Sorry for my previous inaccurate explanation. Actually, I trained the |
Thanks for your fast reply, the last part clears things up regarding the hardware requirements.
If I am not mistaken, you generate a second "adjacent" input to the original one with a small time offset and evaluate your network on this adjacent input a second time. This seems quite computation intensive (50% more train time iteration for the same number of iterations I guess), for "just" one more consistency loss. Why not do something similar to the foreground loss, where you use all 20 future frame predictions, mask out the static points and train those to be close inbetween themselves for the same pixel. |
For the first question:
Yes. It will also include the static foreground objects (such as parking cars). That is why I say "this loss is mostly for the background static objects ..." For the second question:
Yes. This is a good question. It indeed will introduce more training cost since the training data is enlarged. The benefits of introducing this second "adjacent" input are two-fold: (1) it can serve as a kind of "data augmentation"; (2) it enables the computation of background consistency loss.
This is a very good question. First, note that we are actually training the network on a LiDAR video. So training on a single sequence and only focusing on the background cells may not be enough: for two adjacent sequences, the network predictions will have some "flickering". That is, for a static cell, it is predicted to be static in frame i, but it is predicted to be moving in frame i + 1, and it is again predicted to be static in frame i + 2. This is inconsistent. That's why we introduce this loss to help the network to be aware of this consistency. Actually, this consistency is non-trivial to enforce in the context of LiDAR data. I believe there are better ways to do this. For the video consistency, I think this paper could be very helpful for you: https://arxiv.org/pdf/1808.00449.pdf |
Forgot to thank you again for the explanations. I've modified the input field of view to be larger for my application (100x100m instead of 64x64m). Did you experiment with larger field of views. My main follow up question is still about the background temporal consistency loss implementation: # --- Move pixel coord to global and rescale; then rotate; then move back to local pixel coord
translate_to_global = np.array(
[[1.0, 0.0, -120.0], [0.0, 1.0, -120.0], [0.0, 0.0, 1.0]], dtype=np.float32
)
scale_global = np.array(
[[0.25, 0.0, 0.0], [0.0, 0.25, 0.0], [0.0, 0.0, 1.0]], dtype=np.float32
) I am not sure where these numbers come from. It seems like it is part of the pixel-to-meter rescaling, but how exactly need these numbers to be computed? 120!=256/2, 120!=256/64m, ... And sorry for coming back to this again:
My point was not, that there are dynamic, but parked foreground objects in this loss, (those are "static" in that moment, so it is fine for the loss), BUT also dynamic and actually moving objects in this loss. Was this also covered by your "mostly" formulation/is this intentional? |
Hi @DavidS3141 The latest code should be as follows:
And for your question:
For this code, the number in the translation matrix is the half size of the width and height of the BEV map.
For some objects, if at the moment when they are static, they will be covered in this loss at this moment. But when they are moving again, they will not be covered in this loss. |
Hi @DavidS3141 So If I understand correctly, you are wondering why we cover the foreground objects in this loss again? This is a good question. Actually, as I mentioned before, the main purpose of this loss is to provide a sort of consistency for the background objects. But due to the difficulty of this problem, currently it will bring some "by-product". That is, it will also cover some static or moving foreground objects (e.g., the truck in your picture). But this is not a big problem, because the background objects are dominant over the background and the "by-product" could also be helpful to some extent. |
Hi again,
this is a question related to the paper and after skimming the code I was still not quite clear on this.
Your background temporal consistency loss in equation (3) of the paper seems reasonable for static points but not for dynamic ones because you specifically wrote that the alignment transformation T is rigid and therefore cannot account for object motion.
Are you filtering out cells that dynamic/non-background for this loss?
Also why did you need a complete second set of N motion maps for the background loss?
On a side note:
In a different issue #4 (comment) you wrote:
with that GPU only having around 11GB. However even on my Tesla V100 16GB GPU the training
train_multi_seq_MGDA.py
ran out of memory at the very beginning. Running the complete training with 2 GPUs worked, though. Do you have an idea what the reason could be for this?Thanks again for your answer.
The text was updated successfully, but these errors were encountered: