GPUassert: invalid device symbol #5

Shmuma · 2019-08-02T17:35:10Z

Hi!

Trying to make CuLE working, but after setting it up, this script fails with message:

GPUassert: invalid device symbol /home/shmuma/work/tmp/cule/cule/atari/cuda/tables.hpp 43

Script:

import torch
from torchcule.atari import Env


if __name__ == "__main__":
    e = Env('PongNoFrameskip-v4', 2, color_mode='gray',
            device=torch.device('cuda', 0), rescale=True, clip_rewards=True,
            episodic_life=True, repeat_prob=0.0)
    obs = e.reset(initial_steps=4000, verbose=False)
    print(obs)

Having Cuda 10.0, pytorch 1.1.0, drivers 410.79. Python 3.7

The text was updated successfully, but these errors were encountered:

ifrosio · 2019-08-05T23:54:56Z

How many GPUs do you have on your machine and which one is GPU-0 (the one you are using since you are passing ('cuda', 0) as device)?

KyunghyunLee · 2019-08-13T02:07:05Z

I got the exact same error

python ./examples/ppo/ppo_main.py --use-cuda-env --use-openai-test-env  --gpu 0
{'ale_start_steps': 400,
 'alpha': 0.99,
 'batch_size': 256,
 'clip_epsilon': 0.1,
 'conf_file': None,
 'entropy_coef': 0.01,
 'env_name': 'PongNoFrameskip-v4',
 'episodic_life': False,
 'eps': 1e-05,
 'evaluation_episodes': 10,
 'evaluation_interval': 1000000,
 'gamma': 0.99,
 'gpu': 0,
 'local_rank': 0,
 'log_dir': 'runs',
 'loss_scale': None,
 'lr': 0.00065,
 'lr_scale': False,
 'max_episode_length': 18000,
 'max_grad_norm': 0.5,
 'multiprocessing_distributed': False,
 'no_cuda_train': False,
 'normalize': False,
 'num_ales': 16,
 'num_gpus_per_node': -1,
 'num_stack': 4,
 'num_steps': 5,
 'opt_level': 'O0',
 'output_filename': None,
 'plot': False,
 'ppo_epoch': 3,
 'profile': False,
 'save_interval': 0,
 'seed': 1565661279,
 't_max': 50000000,
 'tau': 1.0,
 'use_adam': False,
 'use_cuda_env': True,
 'use_gae': False,
 'use_openai': False,
 'use_openai_test_env': True,
 'value_loss_coef': 0.5,
 'verbose': False}

PyTorch  : 1.1.0
CUDA     : 10.0.130
CUDNN    : 7501
APEX     : 0.1.0

GPUassert: invalid device symbol /home/lkh/Codes/cule/cule/atari/cuda/tables.hpp 43

here is my nvidia-smi

Tue Aug 13 11:06:38 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 33%   57C    P0    65W / 250W |     12MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0  On |                  N/A |
|  0%   48C    P8    16W / 250W |    501MiB / 11177MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+

Shmuma · 2019-08-13T05:33:36Z

I have the same gpu configuration: two 1080TI. вт, 13 авг. 2019 г. в 5:07, KyunghyunLee <notifications@github.com>:

I got the exact same error python ./examples/ppo/ppo_main.py --use-cuda-env --use-openai-test-env --gpu 0 {'ale_start_steps': 400, 'alpha': 0.99, 'batch_size': 256, 'clip_epsilon': 0.1, 'conf_file': None, 'entropy_coef': 0.01, 'env_name': 'PongNoFrameskip-v4', 'episodic_life': False, 'eps': 1e-05, 'evaluation_episodes': 10, 'evaluation_interval': 1000000, 'gamma': 0.99, 'gpu': 0, 'local_rank': 0, 'log_dir': 'runs', 'loss_scale': None, 'lr': 0.00065, 'lr_scale': False, 'max_episode_length': 18000, 'max_grad_norm': 0.5, 'multiprocessing_distributed': False, 'no_cuda_train': False, 'normalize': False, 'num_ales': 16, 'num_gpus_per_node': -1, 'num_stack': 4, 'num_steps': 5, 'opt_level': 'O0', 'output_filename': None, 'plot': False, 'ppo_epoch': 3, 'profile': False, 'save_interval': 0, 'seed': 1565661279, 't_max': 50000000, 'tau': 1.0, 'use_adam': False, 'use_cuda_env': True, 'use_gae': False, 'use_openai': False, 'use_openai_test_env': True, 'value_loss_coef': 0.5, 'verbose': False} PyTorch : 1.1.0 CUDA : 10.0.130 CUDNN : 7501 APEX : 0.1.0 GPUassert: invalid device symbol /home/lkh/Codes/cule/cule/atari/cuda/tables.hpp 43 here is my nvidia-smi Tue Aug 13 11:06:38 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 33% 57C P0 65W / 250W | 12MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:04:00.0 On | N/A | | 0% 48C P8 16W / 250W | 501MiB / 11177MiB | 8% Default | +-------------------------------+----------------------+----------------------+ — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5?email_source=notifications&email_token=AAAQE2SRMTQ3MWQQ43P3PA3QEIJMXA5CNFSM4II7WRDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4EKLII#issuecomment-520660385>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQE2RG7OYS4QR3BVYN4NLQEIJMXANCNFSM4II7WRDA> .

-- wbr, Max Lapan

KyunghyunLee · 2019-08-16T14:26:44Z

I figured out the issue.
When I build torchcule, I got an error at line 11 of setup.py.(https://github.com/NVlabs/cule/blob/master/setup.py#L11)
I modified it to "codes = ['70']", similar to line 14.
torchcule was built successfully, but I got the error message above.

I dig into the table.hpp and find that the code means the architecture of GPU.
I found '70' actually means 'sm_70', and it is for Tesla V100.
Other codes are listed in below link. For 1080TI, it is '61'
http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

I cleaned 'build' and 'dist' folder, then rebuild torchcule.
It works great now.

ifrosio · 2019-08-20T01:18:32Z

Thanks - we are modifying the code to support multiple architectures, although this may require a larger compilation time. Will close when done.

sdalton1 closed this as completed in c5f1960 Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPUassert: invalid device symbol #5

GPUassert: invalid device symbol #5

Shmuma commented Aug 2, 2019

ifrosio commented Aug 5, 2019

KyunghyunLee commented Aug 13, 2019

Shmuma commented Aug 13, 2019 via email

KyunghyunLee commented Aug 16, 2019

ifrosio commented Aug 20, 2019

GPUassert: invalid device symbol #5

GPUassert: invalid device symbol #5

Comments

Shmuma commented Aug 2, 2019

ifrosio commented Aug 5, 2019

KyunghyunLee commented Aug 13, 2019

Shmuma commented Aug 13, 2019 via email

KyunghyunLee commented Aug 16, 2019

ifrosio commented Aug 20, 2019