Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on the environment required to run sol-renderer #5

Open
heiwang1997 opened this issue May 21, 2021 · 8 comments
Open

Question on the environment required to run sol-renderer #5

heiwang1997 opened this issue May 21, 2021 · 8 comments

Comments

@heiwang1997
Copy link

heiwang1997 commented May 21, 2021

Hi @tovacinni , thanks for this great work and the code release. I am trying to run your C++ renderer and meet the following segmentation fault. Can you guide me on how to solve this issue, at your convenience?

The system is Ubuntu 20.04. I've tried both rtx3090 and 1080 and neither of them works. By the way, the python part works well -- I can run the training and generate the rendered armadillo. The libtorch is downloaded from https://download.pytorch.org/libtorch/cu111/libtorch-cxx11-abi-shared-with-deps-1.8.1%2Bcu111.zip

Here is the error message:

    (nglod) my@ws:~/nglod/sol-renderer/build$ ./sdfRenderer ../../sdf-net/_results/armadillo.npz
    NLOD Demo starting...
    GPU Device 0: "Ampere" with compute capability 8.6
    
    terminate called after throwing an instance of 'c10::Error'
      what():  CUDA error: an illegal memory access was encountered
    Exception raised from nonzero_cuda_out_impl at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:873 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f6705badb29 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libc10.so)
    frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7f6705baaab2 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libc10.so)
    frame #2: void at::native::nonzero_cuda_out_impl<bool>(at::Tensor const&, at::Tensor&) + 0xebe (0x7f66a6227c4e in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cuda_cu.so)
    frame #3: at::native::nonzero_out_cuda(at::Tensor&, at::Tensor const&) + 0x1eb (0x7f66a6199c5b in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cuda_cu.so)
    frame #4: at::native::nonzero_cuda(at::Tensor const&) + 0xea (0x7f66a619a09a in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cuda_cu.so)
    frame #5: <unknown function> + 0x2e6a80b (0x7f66a6fd180b in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cuda_cu.so)
    frame #6: <unknown function> + 0x2e6a890 (0x7f66a6fd1890 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cuda_cu.so)
    frame #7: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)> const&, at::Tensor const&) const + 0xe7 (0x7f6692f17c57 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #8: at::nonzero(at::Tensor const&) + 0x5e (0x7f6692d5338e in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #9: <unknown function> + 0x2f15a3e (0x7f6694791a3e in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #10: <unknown function> + 0x2f15ac0 (0x7f6694791ac0 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #11: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)> const&, at::Tensor const&) const + 0xe7 (0x7f6692f17c57 in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #12: at::nonzero(at::Tensor const&) + 0x5e (0x7f6692d5338e in /home/my/nglod/sol-renderer/third-party/libtorch/lib/libtorch_cpu.so)
    frame #13: <unknown function> + 0x4222b (0x555f01cd522b in ./sdfRenderer)
    frame #14: <unknown function> + 0x27750 (0x555f01cba750 in ./sdfRenderer)
    frame #15: <unknown function> + 0x1819a (0x555f01cab19a in ./sdfRenderer)
    frame #16: <unknown function> + 0x20194 (0x7f67060ed194 in /lib/x86_64-linux-gnu/libglut.so.3)
    frame #17: fgEnumWindows + 0x39 (0x7f67060f0c39 in /lib/x86_64-linux-gnu/libglut.so.3)
    frame #18: glutMainLoopEvent + 0x1cd (0x7f67060ed7bd in /lib/x86_64-linux-gnu/libglut.so.3)
    frame #19: glutMainLoop + 0x65 (0x7f67060edff5 in /lib/x86_64-linux-gnu/libglut.so.3)
    frame #20: <unknown function> + 0x18edc (0x555f01cabedc in ./sdfRenderer)
    frame #21: __libc_start_main + 0xf3 (0x7f6617f1a0b3 in /lib/x86_64-linux-gnu/libc.so.6)
    frame #22: <unknown function> + 0x1639e (0x555f01ca939e in ./sdfRenderer)
    
    Aborted (core dumped)
@tovacinni
Copy link
Collaborator

tovacinni commented May 21, 2021

Thanks for your interest in our work!

What version of libtorch are you using? The code was tested on 1.7.1, and using a newer version may cause issues (but I haven't actually tried).

@heiwang1997
Copy link
Author

heiwang1997 commented May 21, 2021

I was using 1.8.1. But just now I tried 1.7.1, which can be downloaded from here, but still no luck -- the error is the same 🤔

I saw in the requirements.txt that for the python renderer the pytorch version should be 1.6. Does the version of libtorch and pytorch have to be the same?

@tovacinni
Copy link
Collaborator

Thanks for trying that out. If you can share with me the NPZ file you generated on Google Drive or something, I can try running it on my side & try to reproduce.

The Python PyTorch version shouldn't matter in theory, since it uses NPZ to bridge between the two and the C++ version uses its own separate PyTorch (libtorch).

@heiwang1997
Copy link
Author

Thanks for the fast response! Here is the npz file: https://drive.google.com/file/d/1EcGrddM3kS_IbVVuS8_3zvja6PCswv1i/view?usp=sharing

@tovacinni
Copy link
Collaborator

I just tried the NPZ and I got the same error too, but still works on the NPZs I have. There might be an issue with the NPZ export in the released code, so I'll take a deeper look at this later today.

@heiwang1997
Copy link
Author

Cool! Thanks for your help. Looking forward to your reply.

@sixftninja
Copy link

Hi @heiwang1997 ,

did you try upgrading PyTorch? I was trying to run nglod on an A4000 gpu and figured that PyTorch 1.6 does not support ampere architecture. Upgrading to latest PyTorch worked.

@Sylva-Lin
Copy link

Hi@heiwang1997,
I also met these errors; how did you solve this question in the end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants