Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking if bev_pool is compiled properly #63

Closed
Divadi opened this issue Jul 8, 2022 · 12 comments
Closed

Checking if bev_pool is compiled properly #63

Divadi opened this issue Jul 8, 2022 · 12 comments
Assignees

Comments

@Divadi
Copy link

Divadi commented Jul 8, 2022

Hello, thank you for releasing the code.

I was trying to use bev_pool in other projects, but I found that my compilation of bev_pool doesn't seem to be yielding expected results. For a toy example:

device = "cuda:4"
bev_pool(
    torch.tensor([[5.0]], device=device),
    torch.tensor([[0, 0, 0, 0]], device=device),
    1, torch.tensor(1, device=device), torch.tensor(1, device=device), torch.tensor(1, device=device))

the output is

tensor([[[[[0.]]]]], device='cuda:4')

when I would expect it to be 5.0.

Please let me know if I have incorrectly used the function.
My environment is PyTorch 1.10.1, cudatoolkit 11.3.1, A6000 GPU.

Thank you!

@Divadi
Copy link
Author

Divadi commented Jul 8, 2022

Actually, it seems my compilation is okay; evaluating the camera-only baseline yields:

mAP: 0.3151                                                                                                                                                                                               
mATE: 0.7155
mASE: 0.2742
mAOE: 0.5419
mAVE: 0.8821
mAAE: 0.2595
NDS: 0.3902
Eval time: 92.3s

Per-class results:
Object Class    AP      ATE     ASE     AOE     AVE     AAE
car     0.498   0.570   0.161   0.127   0.989   0.241
truck   0.265   0.737   0.210   0.142   0.838   0.233
bus     0.341   0.728   0.197   0.083   1.578   0.299
trailer 0.147   0.970   0.232   0.529   0.659   0.062
construction_vehicle    0.076   0.955   0.487   1.043   0.106   0.391
pedestrian      0.348   0.748   0.304   1.388   0.863   0.755
motorcycle      0.272   0.720   0.260   0.557   1.620   0.084
bicycle 0.215   0.597   0.271   0.868   0.403   0.010
traffic_cone    0.495   0.593   0.332   nan     nan     nan
barrier 0.495   0.537   0.287   0.139   nan     nan

which is lower than expected (mAP 33.25, NDS 40.15) but still non-trivial.

Is my usage incorrect by any chance?

@kentang-mit kentang-mit self-assigned this Jul 8, 2022
@kentang-mit
Copy link
Contributor

That's quite interesting. I actually did not test bev_pool on small toy examples, I instead just integrated it into our pipeline and train the entire network, so there might be some boundary cases that I made some mistakes during the implementation.

Regarding the evaluation results, may I ask how many GPUs are you using? I also think the compilation should be correct, but such an accuracy drop looks unexpected to me.

@Divadi
Copy link
Author

Divadi commented Jul 8, 2022

Evaluating is using 4 GPUs.

Actually, bev_pool is being really strange for me. When used as part of the pipeline, it yields reasonable results. So, I tried adding

import pickle
pickle.dump([feats, coords, B, D, H, W, x], open(PICKLE_PATH, 'wb+'))
assert False

right after

x = x.permute(0, 4, 1, 2, 3).contiguous()

Then, I made another file loading the pickle results

import torch
from mmdet3d.ops import bev_pool
import pickle

def load_pickle(f):
    return pickle.load(open(f, 'rb'))

feats, coords, B, D, H, W, x = load_pickle(PICKLE_PATH)
k = bev_pool(feats, coords, B, D, H, W)

print((k != 0).sum(), (x != 0).sum())

And for some reason, the results are different!

tensor(0, device='cuda:2') tensor(4805600, device='cuda:2')

I've never had this issue with cuda operations before, and I'm not quite sure how to go about debugging this issue since it clearly works as part of the entire pipeline but not on its own

@Divadi
Copy link
Author

Divadi commented Jul 8, 2022

Another detail: when I paste the toy example

device = x.device
a = bev_pool(
     torch.tensor([[5.0]], device=device),
     torch.tensor([[0, 0, 0, 0]], device=device),
        1, torch.tensor(1, device=device), torch.tensor(1, device=device), torch.tensor(1, device=device))
print(a)
assert False

and run it as part of the pipeline by pasting it after this line

x = bev_pool(x, geom_feats, B, self.nx[2], self.nx[0], self.nx[1])

the correct result is printed.

Is it possible that there's something wrong with my installation?

@kentang-mit
Copy link
Contributor

I'm still working on that. Will get back to you once I finished investigating this issue.

@kentang-mit
Copy link
Contributor

kentang-mit commented Jul 11, 2022

Hi @Divadi,

I looked into this issue recently. Would you mind trying out

CUDA_VISIBLE_DEVICES=4 python [your script].py

and modify the device to cuda:0? Besides, I've pushed a new commit to the repo, would you mind also trying out the latest version?

Best,
Haotian

@kentang-mit
Copy link
Contributor

By the way, for multi-gpu evaluation, would you mind also exploring these two directions?

  • First, let's see whether things work out if you use all the available GPUs on your machine. I would assume that your machine has >4 GPUs because you have cuda:4.

  • Second, let's see whether the results are correct if you evaluate with only one GPU.

@Divadi
Copy link
Author

Divadi commented Jul 11, 2022

Hi @Divadi,

I looked into this issue recently. Would you mind trying out
`

CUDA_VISIBLE_DEVICES=4 python [your script].py

and modify the device to cuda:0? Besides, I've pushed a new commit to the repo, would you mind also trying out the latest version?

Best, Haotian

Before the change, with the toy example above:

$ CUDA_VISIBLE_DEVICES=4 python tools/tmp.py 
tensor([[[[[5.]]]]], device='cuda:0')
$ python tools/tmp.py 
tensor([[[[[0.]]]]], device='cuda:4')

After the change:

$ CUDA_VISIBLE_DEVICES=4 python tools/tmp.py 
tensor([[[[[5.]]]]], device='cuda:0')
$ python tools/tmp.py 
tensor([[[[[5.]]]]], device='cuda:4')

Seems like that was the issue; really odd, but good catch!

By the way, for multi-gpu evaluation, would you mind also exploring these two directions?

  • First, let's see whether things work out if you use all the available GPUs on your machine. I would assume that your machine has >4 GPUs because you have cuda:4.
  • Second, let's see whether the results are correct if you evaluate with only one GPU.

I'll look into this soon, need a bit of time

@Divadi
Copy link
Author

Divadi commented Jul 13, 2022

@kentang-mit

By the way, for multi-gpu evaluation, would you mind also exploring these two directions?

  • First, let's see whether things work out if you use all the available GPUs on your machine. I would assume that your machine has >4 GPUs because you have cuda:4.
  • Second, let's see whether the results are correct if you evaluate with only one GPU.

When evaluating with just one GPU or all GPUs, results are same as before.

@kentang-mit
Copy link
Contributor

Thanks for the update. I'll investigate that.

@Divadi
Copy link
Author

Divadi commented Jul 16, 2022

@kentang-mit
Hi, I have addressed the issue. The problem was my installation had Pillow 9.2.0, while the repository requires 8.4.0 to function properly. More details can be found
HuangJunJie2017/BEVDet#41

I think Pillow 8.4.0 should be listed as an important requirement (sorry if I missed it).

New results:

mAP: 0.3325
mATE: 0.6828
mASE: 0.2717
mAOE: 0.5379
mAVE: 0.9040
mAAE: 0.2505
NDS: 0.4015
Eval time: 89.4s

Per-class results:
Object Class    AP      ATE     ASE     AOE     AVE     AAE
car     0.523   0.541   0.159   0.124   0.969   0.225
truck   0.280   0.704   0.208   0.131   0.911   0.233
bus     0.353   0.681   0.191   0.084   1.559   0.296
trailer 0.167   0.985   0.233   0.504   0.660   0.052
construction_vehicle    0.082   0.859   0.481   1.056   0.121   0.364
pedestrian      0.367   0.724   0.303   1.393   0.863   0.753
motorcycle      0.296   0.721   0.256   0.547   1.768   0.073
bicycle 0.237   0.577   0.270   0.862   0.382   0.007
traffic_cone    0.517   0.524   0.332   nan     nan     nan
barrier 0.503   0.513   0.284   0.140   nan     nan

@Divadi Divadi closed this as completed Jul 16, 2022
@kentang-mit
Copy link
Contributor

Thank you for the very important hint. I'll add that to the README immediately!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants