Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUassert: invalid device symbol #5

Closed
Shmuma opened this issue Aug 2, 2019 · 5 comments
Closed

GPUassert: invalid device symbol #5

Shmuma opened this issue Aug 2, 2019 · 5 comments

Comments

@Shmuma
Copy link

Shmuma commented Aug 2, 2019

Hi!

Trying to make CuLE working, but after setting it up, this script fails with message:

GPUassert: invalid device symbol /home/shmuma/work/tmp/cule/cule/atari/cuda/tables.hpp 43

Script:

import torch
from torchcule.atari import Env


if __name__ == "__main__":
    e = Env('PongNoFrameskip-v4', 2, color_mode='gray',
            device=torch.device('cuda', 0), rescale=True, clip_rewards=True,
            episodic_life=True, repeat_prob=0.0)
    obs = e.reset(initial_steps=4000, verbose=False)
    print(obs)

Having Cuda 10.0, pytorch 1.1.0, drivers 410.79. Python 3.7

@ifrosio
Copy link
Contributor

ifrosio commented Aug 5, 2019

How many GPUs do you have on your machine and which one is GPU-0 (the one you are using since you are passing ('cuda', 0) as device)?

@KyunghyunLee
Copy link

I got the exact same error

python ./examples/ppo/ppo_main.py --use-cuda-env --use-openai-test-env  --gpu 0
{'ale_start_steps': 400,
 'alpha': 0.99,
 'batch_size': 256,
 'clip_epsilon': 0.1,
 'conf_file': None,
 'entropy_coef': 0.01,
 'env_name': 'PongNoFrameskip-v4',
 'episodic_life': False,
 'eps': 1e-05,
 'evaluation_episodes': 10,
 'evaluation_interval': 1000000,
 'gamma': 0.99,
 'gpu': 0,
 'local_rank': 0,
 'log_dir': 'runs',
 'loss_scale': None,
 'lr': 0.00065,
 'lr_scale': False,
 'max_episode_length': 18000,
 'max_grad_norm': 0.5,
 'multiprocessing_distributed': False,
 'no_cuda_train': False,
 'normalize': False,
 'num_ales': 16,
 'num_gpus_per_node': -1,
 'num_stack': 4,
 'num_steps': 5,
 'opt_level': 'O0',
 'output_filename': None,
 'plot': False,
 'ppo_epoch': 3,
 'profile': False,
 'save_interval': 0,
 'seed': 1565661279,
 't_max': 50000000,
 'tau': 1.0,
 'use_adam': False,
 'use_cuda_env': True,
 'use_gae': False,
 'use_openai': False,
 'use_openai_test_env': True,
 'value_loss_coef': 0.5,
 'verbose': False}

PyTorch  : 1.1.0
CUDA     : 10.0.130
CUDNN    : 7501
APEX     : 0.1.0

GPUassert: invalid device symbol /home/lkh/Codes/cule/cule/atari/cuda/tables.hpp 43

here is my nvidia-smi

Tue Aug 13 11:06:38 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 33%   57C    P0    65W / 250W |     12MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0  On |                  N/A |
|  0%   48C    P8    16W / 250W |    501MiB / 11177MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+

@Shmuma
Copy link
Author

Shmuma commented Aug 13, 2019 via email

@KyunghyunLee
Copy link

I figured out the issue.
When I build torchcule, I got an error at line 11 of setup.py.(https://github.com/NVlabs/cule/blob/master/setup.py#L11)
I modified it to "codes = ['70']", similar to line 14.
torchcule was built successfully, but I got the error message above.

I dig into the table.hpp and find that the code means the architecture of GPU.
I found '70' actually means 'sm_70', and it is for Tesla V100.
Other codes are listed in below link. For 1080TI, it is '61'
http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

I cleaned 'build' and 'dist' folder, then rebuild torchcule.
It works great now.

@ifrosio
Copy link
Contributor

ifrosio commented Aug 20, 2019

Thanks - we are modifying the code to support multiple architectures, although this may require a larger compilation time. Will close when done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants