Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

demo examples not working, ' terminated by signal SIGSEGV (Address boundary error) #457

Closed
lweingart opened this issue Feb 3, 2021 · 9 comments
Labels
community/help wanted extra attention is needed Jetson

Comments

@lweingart
Copy link

lweingart commented Feb 3, 2021

Hello team,

I just finished installing mmpose on a Jetson AGX Xavier and decided to run the demos, to verify that everything was working well.

I first ran the example using images:

python3 demo/top_down_img_demo.py \
    configs/top_down/hrnet/coco/hrnet_w48_coco_256x192.py \
    https://download.openmmlab.com/mmpose/top_down/hrnet/hrnet_w48_coco_256x192-b9e0b3ab_20200708.pth \
    --img-root tests/data/coco/ --json-file tests/data/coco/test_coco.json \
    --out-img-root vis_results

and had this result:
zsh: segmentation fault (core dumped) python3 demo/top_down_img_demo.py --video-path demo/demo_video.mp4
I re-ran it and the second time it went through and I could see the vis_0.jpg to vis_3.jpg images with the pose drawn on them.

The demo about the video refuses to work though

command:

python3 demo/bottom_up_video_demo.py \
    configs/bottom_up/hrnet/coco/hrnet_w32_coco_512x512.py \
    https://download.openmmlab.com/mmpose/bottom_up/hrnet_w32_coco_512x512-bcb8c247_20200816.pth \
    --video-path demo/demo_video.mp4 \
    --out-video-root vis_results

and the result keeps being
zsh: segmentation fault (core dumped) python3 demo/bottom_up_video_demo.py --video-path demo/demo_video.mp4

Here is the result of the PYTHONPATH=${PWD}:$PYTHONPATH python mmpose/utils/collect_env.py command:

sys.platform: linux
Python: 3.6.9 (default, Oct  8 2020, 12:12:24) [GCC 8.4.0]
CUDA available: True
GPU 0: Xavier
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.7.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.5
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_53,code=sm_53;-gencode;arch=compute_62,code=sm_62;-gencode;arch=compute_72,code=sm_72
  - CuDNN 8.0
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -DMISSING_ARM_VST1 -DMISSING_ARM_VLD1 -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, USE_CUDA=ON, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=0, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.0a0+45f960c
OpenCV: 4.4.0
MMCV: 1.2.6
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.2
MMPose: 0.11.0+d51e6e9

I installed PyTorch from the binaries given by nvidia here:
https://forums.developer.nvidia.com/t/pytorch-for-jetson-version-1-7-0-now-available/72048

I installed everything using pip, or from source when it was not working (as is the case for quite a few of the requirements by the way)

mmdetection is also installed

If you have any idea where to look, I would gladly take any hint.
Thank you very much for your help

@innerlee
Copy link
Contributor

innerlee commented Feb 4, 2021

Please try the image demo with mmdet, to see if it can process images one by one. https://github.com/open-mmlab/mmpose/blob/master/demo/2d_human_pose_demo.md#using-mmdet-for-human-bounding-box-detection

@lweingart
Copy link
Author

Hello innerlee,

Unfortunately it does not appear to be successful.
Here is the command I ran:

python3 demo/top_down_img_demo_with_mmdet.py \
    demo/mmdetection_cfg/faster_rcnn_r50_fpn_1x_coco.py \
    http://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
    configs/top_down/hrnet/coco/hrnet_w48_coco_256x192.py \
    https://download.openmmlab.com/mmpose/top_down/hrnet/hrnet_w48_coco_256x192-b9e0b3ab_20200708.pth \
    --img-root tests/data/coco/ \
    --img 000000196141.jpg \
    --out-img-root vis_results

and that's the output

/home/jetson/git/mmdetection/mmdet/models/builder.py:72: UserWarning: train_cfg and test_cfg is deprecated, please specify them in model
  'please specify them in model', UserWarning)
zsh: segmentation fault (core dumped)  python3 demo/top_down_img_demo_with_mmdet.py     --img-root tests/data/coco/

@lweingart
Copy link
Author

Hi again innerlee,

I have installed mmpose on a jetson nano about two weeks ago, and mmpose on a jetson agx xavier two days ago.
The segmentation faults are on the jetson xavier, but it works fine on the jetson nano.

This command on the nano works fine:

python3 demo/bottom_up_video_demo.py \
    configs/bottom_up/hrnet/coco/hrnet_w32_coco_512x512.py \
    https://download.openmmlab.com/mmpose/bottom_up/hrnet_w32_coco_512x512-bcb8c247_20200816.pth \
    --video-path demo/demo.mp4 \
    --out-video-root vis_results

but fails with a segmentation fault on the xavier.

Here is the result of the PYTHONPATH=${PWD}:$PYTHONPATH python mmpose/utils/collect_env.py on the nano:

sys.platform: linux
Python: 3.6.9 (default, Oct  8 2020, 12:12:24) [GCC 8.4.0]
CUDA available: True
GPU 0: NVIDIA Tegra X1
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.7.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.5
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_53,code=sm_53;-gencode;arch=compute_62,code=sm_62;-gencode;arch=compute_72,code=sm_72
  - CuDNN 8.0
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -DMISSING_ARM_VST1 -DMISSING_ARM_VLD1 -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, USE_CUDA=ON, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=0, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.0a0+45f960c
OpenCV: 4.4.0
MMCV: 1.2.2
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.2
MMPose: 0.9.0+4b00eeb

Should I maybe try to reinstall all the necessary components of mmpose ?

@innerlee
Copy link
Contributor

innerlee commented Feb 5, 2021

I have no experience in jetson xavier :(
My uneducated guess is that its a pytorch+nvidia thing. Could you succesfully train/inference a classification model, such as resnet18 on it?

@innerlee innerlee added the Jetson label Feb 5, 2021
@lweingart
Copy link
Author

Could you succesfully train/inference a classification model, such as resnet18 on it?

Do you mean theoretically, or is that a pointer to debug the problem ?
All Jetson devices are designed for AI and deep learning, but I'm very new to these devices and I didn't have time to train anything with these yet, to be honest.

However, the Jetson Nano, on which the mmpose installation is working, is far less powerfull than the Xavier.

Jetson Nano:
GPU: 128-core NVIDIA Maxwell™ architecture-based GPU.
CPU: Quad-core ARM® A57.
Video: 4K @ 30 fps
Camera: MIPI CSI-2 DPHY lanes, 12x (Module) and 1x (Developer Kit)
Memory: 4 GB 64-bit LPDDR4; 25.6 gigabytes/second.

Jetson AGX Xavier:
GPU : NVIDIA Volta™ architecture with 512 NVIDIA CUDA cores and 64 Tensor cores11 TFLOPS (FP16)22 TOPS (INT8)
DL Accelerator : 5 TFLOPS (FP16)10 TOPS (INT8)
CPU : 8-Core Carmel ARM v8.2 64-Bit CPU, 8 MB L2 + 4 MB L3
Memory : 32 GB 256-Bit LPDDR4x 2133 MHz - 136.5 GB/s
Storage : 32 GB eMMC 5.1
Camera : 16 lanes MIPI CSI-2, 8 lanes SLVS-EC D-PHY (40 Gbps), C-PHY(109 Gbps)

Anyway, thank you for you availability. I will try to re-install everything when I have time, and will tell you if it worked or not.
It's just that I will not have time to do so in the next few days.

I'll be back :-)

@lweingart
Copy link
Author

Hello,

So, I reinstalled everything on my Jetson Xavier from scratch (system, jetpack, torch, opencv and everything), and it still segfault.
I didn't think about at first, as I installed everything on my Jetson Nano and my Jetson Xavier with just a few weeks in between, but the versions have change.

On the Nano, on which the installation works, I have
mmcv 1.2.2
mmdet 2.7.0
mmpose 0.9.0

when on the Xavier I have
mmcv 1.2.4
mmdet 2.9.0
mmpose 0.11.0.

I will try to install these older versions on the Xavier, see if the solves the segfault somehow.
to be continued...

@innerlee
Copy link
Contributor

Again I have no experience in Jetson. We can try to isolate the error.

  1. See if demo in mmdet runs https://github.com/open-mmlab/mmdetection/tree/master/demo . If it runs successfully, then its a mmpose bug. Otherwise,
  2. See if pytorch runs. Try the example in the bottom of the page https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html . If pytorch example works properly, then
  3. Monitor the usage of cpu memory, swp, /dev/shm, etc. to see if any of the resources are not available for large models

@lweingart
Copy link
Author

Hi @innerlee,

So, to have consistent results between my Jetson Xavier and my Jetson Nano, I installed on the Xavier the same versions as on the Nano, which is to say:
mmcv 1.2.2
mmdet 2.7.0
mmpose 0.9.0

So, I tried torch and it seems to be working just fine.
I tried to run the mmdet verification found here: https://mmdetection.readthedocs.io/en/latest/get_started.html#verification
and I get a different result than on the Nano (on which it runs perfectly well).
It seems to fail on the Xavier because of cuda somehow.

>>> from mmdet.apis import init_detector, inference_detector
>>> config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
>>> checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
>>> device = 'cuda:0'
>>> model = init_detector(config_file, checkpoint_file, device=device)
>>> inference_detector(model, 'demo/demo.jpg')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jetson/git/mmdetection/mmdet/apis/inference.py", line 123, in inference_detector
    result = model(return_loss=False, rescale=True, **data)[0]
  File "/home/jetson/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jetson/git/mmcv/mmcv/runner/fp16_utils.py", line 84, in new_func
    return old_func(*args, **kwargs)
  File "/home/jetson/git/mmdetection/mmdet/models/detectors/base.py", line 182, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/home/jetson/git/mmdetection/mmdet/models/detectors/base.py", line 159, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/home/jetson/git/mmdetection/mmdet/models/detectors/two_stage.py", line 194, in simple_test
    proposal_list = self.rpn_head.simple_test_rpn(x, img_metas)
  File "/home/jetson/git/mmdetection/mmdet/models/dense_heads/rpn_test_mixin.py", line 36, in simple_test_rpn
    proposal_list = self.get_bboxes(*rpn_outs, img_metas)
  File "/home/jetson/git/mmcv/mmcv/runner/fp16_utils.py", line 164, in new_func
    return old_func(*args, **kwargs)
  File "/home/jetson/git/mmdetection/mmdet/models/dense_heads/anchor_head.py", line 571, in get_bboxes
    scale_factor, cfg, rescale)
  File "/home/jetson/git/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 167, in _get_bboxes_single
    dets, keep = batched_nms(proposals, scores, ids, nms_cfg)
  File "/home/jetson/git/mmcv/mmcv/ops/nms.py", line 289, in batched_nms
    dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
  File "/home/jetson/git/mmcv/mmcv/utils/misc.py", line 310, in new_func
    output = old_func(*args, **kwargs)
  File "/home/jetson/git/mmcv/mmcv/ops/nms.py", line 148, in nms
    inds = NMSop.apply(boxes, scores, iou_threshold, offset)
  File "/home/jetson/git/mmcv/mmcv/ops/nms.py", line 19, in forward
    bboxes, scores, iou_threshold=float(iou_threshold), offset=offset)
RuntimeError: CUDA error: no kernel image is available for execution on the device

What bothers me is that when I run the checks in cuda/samples like deviceQuery or bandwidthTest, everything seems to be working fine.

Then I would tend to think that pytorch may be the cause of the problem too, but every test I can find to verify the successfull installation of pytorch passes successfully.

My last attempt, building pytorch from source.

...

I forgot to click on the comment button and this is now a few days old.
I built torch from source and installed it successfully, but I still have segfault when trying to use mmpose on my jetson xavier.
I'm having difficulties to understand why it is different on the jetson nano, and I'm out of options.

Anyway, thank you for your help.

@innerlee
Copy link
Contributor

innerlee commented Mar 9, 2021

@lweingart Thanks for your additional input. From the stacktrace, it is the nms op that cannot run. So my bet is that ops in mmcv was not compiled successfully. To verify this, you can run the testing code here https://github.com/open-mmlab/mmcv/blob/48d990258549ca626fcf8c34488c00ed6fce108a/tests/test_ops/test_nms.py#L137-L155 Note that the code tests cpu version by default. So you can put all tensors to cuda to trigger the gpu version of nms.

@Tau-J Tau-J closed this as completed Apr 24, 2023
HAOCHENYE added a commit to HAOCHENYE/mmpose that referenced this issue Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community/help wanted extra attention is needed Jetson
Projects
None yet
Development

No branches or pull requests

3 participants