Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error:: an illegal memory access was encountered #19

Open
lin1061991611 opened this issue Jan 4, 2021 · 11 comments
Open

error:: an illegal memory access was encountered #19

lin1061991611 opened this issue Jan 4, 2021 · 11 comments

Comments

@lin1061991611
Copy link

lin1061991611 commented Jan 4, 2021

Hi,when i run this code with my own dataset,it errors as:

Use config:
{'CONST': {'DEVICE': '0', 'NUM_WORKERS': 0, 'N_INPUT_POINTS': 2048},
 'DATASET': {'TEST_DATASET': 'Completion3D', 'TRAIN_DATASET': 'Completion3D'},
 'DATASETS': {'COMPLETION3D': {'CATEGORY_FILE_PATH': './datasets/Completion3D.json',
                               'COMPLETE_POINTS_PATH': '/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/datasets/Completion3D/%s/gt/%s/%s.h5',
                               'PARTIAL_POINTS_PATH': '/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/datasets/Completion3D/%s/partial/%s/%s.h5'},
              'KITTI': {'BOUNDING_BOX_FILE_PATH': '/home/SENSETIME/xiehaozhe/Datasets/KITTI/bboxes/%s.txt',
                        'CATEGORY_FILE_PATH': './datasets/KITTI.json',
                        'PARTIAL_POINTS_PATH': '/home/SENSETIME/xiehaozhe/Datasets/KITTI/cars/%s.pcd'},
              'SHAPENET': {'CATEGORY_FILE_PATH': './datasets/ShapeNet.json',
                           'COMPLETE_POINTS_PATH': '/home/SENSETIME/xiehaozhe/Datasets/ShapeNet/ShapeNetCompletion/%s/complete/%s/%s.pcd',
                           'N_POINTS': 16384,
                           'N_RENDERINGS': 8,
                           'PARTIAL_POINTS_PATH': '/home/SENSETIME/xiehaozhe/Datasets/ShapeNet/ShapeNetCompletion/%s/partial/%s/%s/%02d.pcd'}},
 'DIR': {'OUT_PATH': './output'},
 'MEMCACHED': {'CLIENT_CONFIG': '/mnt/lustre/share/memcached_client/client.conf',
               'ENABLED': False,
               'LIBRARY_PATH': '/mnt/lustre/share/pymc/py3',
               'SERVER_CONFIG': '/mnt/lustre/share/memcached_client/server_list.conf'},
 'NETWORK': {'GRIDDING_LOSS_ALPHAS': [0.1],
             'GRIDDING_LOSS_SCALES': [128],
             'N_SAMPLING_POINTS': 2048},
 'TEST': {'METRIC_NAME': 'ChamferDistance'},
 'TRAIN': {'BATCH_SIZE': 1,
           'BETAS': [0.9, 0.999],
           'GAMMA': 0.5,
           'LEARNING_RATE': 0.0001,
           'LR_MILESTONES': [50],
           'N_EPOCHS': 150,
           'SAVE_FREQ': 25,
           'WEIGHT_DECAY': 0}}
[INFO] 2021-01-04 11:01:13,525 Collecting files of Taxonomy [ID=all, Name=Uncategorized Test Set]
[INFO] 2021-01-04 11:01:13,529 Collecting files of Taxonomy [ID=02691156, Name=classic]
[INFO] 2021-01-04 11:01:13,532 Collecting files of Taxonomy [ID=02933112, Name=other]
[INFO] 2021-01-04 11:01:13,533 Complete collecting files of the dataset. Total files: 104
[INFO] 2021-01-04 11:01:13,534 Collecting files of Taxonomy [ID=all, Name=Uncategorized Test Set]
[INFO] 2021-01-04 11:01:13,535 Collecting files of Taxonomy [ID=02691156, Name=classic]
[INFO] 2021-01-04 11:01:13,537 Collecting files of Taxonomy [ID=02933112, Name=other]
[INFO] 2021-01-04 11:01:13,538 Complete collecting files of the dataset. Total files: 14
[DEBUG] 2021-01-04 11:01:14,724 Parameters in GRNet: 76707626.
[INFO] 2021-01-04 11:01:19,336 [Epoch 1/150][Batch 1/104] BatchTime = 0.697 (s) DataTime = 0.049 (s) Losses = ['535.0128', '533.6913']
[INFO] 2021-01-04 11:01:19,450 [Epoch 1/150][Batch 2/104] BatchTime = 0.114 (s) DataTime = 0.020 (s) Losses = ['581.4405', '579.0204']
[INFO] 2021-01-04 11:01:19,598 [Epoch 1/150][Batch 3/104] BatchTime = 0.147 (s) DataTime = 0.055 (s) Losses = ['758.7496', '758.8049']
[INFO] 2021-01-04 11:01:19,695 [Epoch 1/150][Batch 4/104] BatchTime = 0.098 (s) DataTime = 0.006 (s) Losses = ['695.8061', '692.7615']
[INFO] 2021-01-04 11:01:19,793 [Epoch 1/150][Batch 5/104] BatchTime = 0.097 (s) DataTime = 0.006 (s) Losses = ['554.6122', '544.2510']
[INFO] 2021-01-04 11:01:19,931 [Epoch 1/150][Batch 6/104] BatchTime = 0.138 (s) DataTime = 0.044 (s) Losses = ['539.5575', '530.9702']
[INFO] 2021-01-04 11:01:20,071 [Epoch 1/150][Batch 7/104] BatchTime = 0.141 (s) DataTime = 0.048 (s) Losses = ['556.2327', '553.9484']
[INFO] 2021-01-04 11:01:20,171 [Epoch 1/150][Batch 8/104] BatchTime = 0.099 (s) DataTime = 0.006 (s) Losses = ['682.0801', '675.5230']
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=77 : an illegal memory access was encountered

Did you meet the issue before? Thanks for advance.

@lin1061991611
Copy link
Author

lin1061991611 commented Jan 4, 2021

Full error report as:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "/home/yaogan_504/pycharm-community-2020.2.2/plugins/python-ce/helpers/pydev/pydevd.py", line 1448, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/yaogan_504/pycharm-community-2020.2.2/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/runner.py", line 76, in <module>
    main()
  File "/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/runner.py", line 58, in main
    train_net(cfg)
  File "/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/core/train.py", line 112, in train_net
    sparse_ptcloud, dense_ptcloud = grnet(data)
  File "/home/yaogan_504/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yaogan_504/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/yaogan_504/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/models/grnet.py", line 138, in forward
    sparse_cloud = self.point_sampling(sparse_cloud, partial_cloud)
  File "/home/yaogan_504/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/models/grnet.py", line 21, in forward
    pred_cloud = torch.cat([partial_cloud, pred_cloud], dim=1)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

@hzxie
Copy link
Owner

hzxie commented Jan 4, 2021

I've never met the issue before.
There are no other users reporting similar issues.
How about the ShapeNet dataset?

@yjcaimeow
Copy link

hi @hzxie
I also meet the same error. RuntimeError: CUDA error: an illegal memory access was encountered

Here is something may be helpful and may be something of feature sampling has error?

image

and it reports the "point features" has NaN is True.
tensor(True, device='cuda:0') =================point features after feature sample=================

Looking for your reply.
Best

@hzxie
Copy link
Owner

hzxie commented Apr 6, 2021

@yjcaimeow
Maybe the coordinates of the points in the sparse_cloud out of the range (-1 , 1).

@yjcaimeow
Copy link

@hzxie

I check the range of sparse_cloud and it is in [-1, 1]. The sparse cloud is get from the gridding_rev operation.

I also try to *0.5. The same error also exist. :(

@leonardozcm
Copy link

hi, @lin1061991611 Maybe your extensions are complied with a not corresponding version of cuda or cudnn? I've met this issue with a tensorflow model once.

@alexzhou907
Copy link

has anyone solved this issue? I'm using pytorch=1.4.0, cuda=10.1 with a fresh conda environment. Still getting this error.

@NorthSummer
Copy link

has anyone solved this issue? I'm using pytorch=1.4.0, cuda=10.1 with a fresh conda environment. Still getting this error.

I think this is because its extension "griding". Its output tensors refuse to be readed, even if get printed, or refered to a specific GPU device.

@cckamiya
Copy link

Has anyone solved this issue? I'm using pytorch=1.8.1+cu111 with a fresh conda environment. Still getting this error.

@cckamiya
Copy link

I found the reason for my problem because the training data was not normalized.

@Tian0906
Copy link

Tian0906 commented Jul 10, 2024

I also meet the same problem,. when I use my own dataset to inference, the program stops here:

class GriddingFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, scale, ptcloud):
        grid, grid_pt_weights, grid_pt_indexes = gridding.forward(-scale, scale - 1, -scale, scale - 1, -scale,
                                                                  scale - 1, ptcloud)

the original input to the GriddingFunction( that is:gridding.forward(...)) is:

partial_cloud.size: torch.Size([1, 2048, 3])
partial_cloud coordinate: tensor([[[ 95.9131, -43.8667, 407.3000],
[ 6.2308, -5.8511, 400.2000],
[ 93.6227, -41.6913, 406.4000],
...,
[ 40.1768, 4.3794, 399.9000],
[ 76.5817, -5.4377, 399.9000],
[ 56.0437, -37.4217, 396.4000]]], device='cuda:0')

but I saw the reply from upstairs, thought it may because of the lack of normalized. Then I normalized the partial_cloud, input the GriddingFunction( that is:gridding.forward(...)) , BUT still not worked.

normalized_partial_cloud: tensor([[[ 0.8058, -0.2408, -0.2035],
[-0.9362, 0.6257, -0.6163],
[ 0.7613, -0.1912, -0.2558],
...,
[-0.2768, 0.8588, -0.6337],
[ 0.4303, 0.6351, -0.6337],
[ 0.0314, -0.0939, -0.8372]]], device='cuda:0')

Then I try to print relative information like:

        print("grid: ", grid)
        print("grid_pt_weights: ", grid_pt_weights)
        print("grid_pt_indexes: ", grid_pt_indexes)

But before printing this, the program has reported an error:

torch.Size([1, 262144])
torch.Size([1, 2048, 8, 3])
torch.Size([1, 2048, 8])
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCReduceAll.cuh line=327 error=700 : an illegal memory access was encountered
grid: Traceback (most recent call last):
File "runner.py", line 77, in
main()
File "runner.py", line 68, in main
inference_net(cfg)
File "/root/autodl-tmp/GRNet/core/inference.py", line 59, in inference_net
sparse_ptcloud, dense_ptcloud = grnet(data)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/GRNet/models/grnet.py", line 125, in forward
pt_features_64_l = self.gridding(normalized_partial_cloud).view(-1, 1, 64, 64, 64)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/GRNet/extensions/gridding/init.py", line 52, in forward
grids.append(GriddingFunction.apply(self.scale, p))
File "/root/autodl-tmp/GRNet/extensions/gridding/init.py", line 23, in forward
print("grid: ", grid)
File "/root/miniconda3/lib/python3.8/site-packages/torch/tensor.py", line 162, in repr
return torch._tensor_str._str(self)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 315, in _str
tensor_str = _tensor_str(self, indent)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 213, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 88, in init
nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCReduceAll.cuh:327

How can I fix this? Please advise, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants