Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THCudaCheck FAIL #46

Closed
tejas1995 opened this issue Jul 20, 2020 · 2 comments
Closed

THCudaCheck FAIL #46

tejas1995 opened this issue Jul 20, 2020 · 2 comments

Comments

@tejas1995
Copy link

I am trying to train VL-BERT for RefCOCO+ (python refcoco/train_end2end.py --cfg cfgs/refcoco/base_detected_regions_4x16G.yaml). However, I am getting the following CUDA-related error.

THCudaCheck FAIL file=/project/ocean/tsriniva/VL-BERT/common/lib/roi_pooling/cuda/ROIAlign_cuda.cu line=297 error=98 : invalid device function
Traceback (most recent call last):
  File "refcoco/train_end2end.py", line 60, in <module>
    main()
  File "refcoco/train_end2end.py", line 54, in main
    rank, model = train_net(args, config)
  File "/project/ocean/tsriniva/VL-BERT/refcoco/../refcoco/function/train.py", line 323, in train_net
    gradient_accumulate_steps=config.TRAIN.GRAD_ACCUMULATE_STEPS)
  File "/project/ocean/tsriniva/VL-BERT/refcoco/../common/trainer.py", line 115, in train
    outputs, loss = net(*batch)
  File "/home/tsriniva/anaconda2/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/ocean/tsriniva/VL-BERT/refcoco/../common/module.py", line 22, in forward
    return self.train_forward(*inputs, **kwargs)
  File "/project/ocean/tsriniva/VL-BERT/refcoco/../refcoco/modules/resnet_vlbert_for_refcoco.py", line 96, in train_forward
    segms=None)
  File "/home/tsriniva/anaconda2/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/ocean/tsriniva/VL-BERT/refcoco/../common/fast_rcnn.py", line 149, in forward
    roi_align_res = self.roi_align(img_feats['body4'], rois).type(images.dtype)
  File "/home/tsriniva/anaconda2/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/ocean/tsriniva/VL-BERT/refcoco/../common/lib/roi_pooling/roi_align.py", line 69, in forward
    input.float(), rois.float(), self.output_size, self.spatial_scale, self.sampling_ratio
  File "/project/ocean/tsriniva/VL-BERT/refcoco/../common/lib/roi_pooling/roi_align.py", line 20, in forward
    input, rois, spatial_scale, output_size[0], output_size[1], sampling_ratio
RuntimeError: cuda runtime error (98) : invalid device function at /project/ocean/tsriniva/VL-BERT/common/lib/roi_pooling/cuda/ROIAlign_cuda.cu:297
Segmentation fault (core dumped)

Is there any fix for this?

@jackroos
Copy link
Owner

jackroos commented Jul 22, 2020

Could you provide version of your system, cuda, gcc, python and pytorch? Thank you!

@tejas1995
Copy link
Author

System version: Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-101-generic x86_64)
Cuda: 10.0
gcc: 7.5.0
python: 3.6.10
pytorch: 1.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants