question: Nvidia A100 cuda 11.8 which implementation to use? #6018

ssh352 · 2023-08-03T08:20:23Z

Hi, thanks for the great package, I have multiple A100 on the server, do you recommend OpenCL version or cuda version? which is faster and has more features? I tried both and both seem to work.

However I've some trouble specifying the gpu id with cuda version. Thanks!

shiyu1994 · 2023-08-04T02:51:51Z

Thanks for using LightGBM. Currently both versions don't support multi-GPU training. We'll add multi-GPU support for cuda version very soon.

ssh352 · 2023-08-04T06:00:12Z

@shiyu1994 Thanks, when I run the Higgs example from the documentation, https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html#run-your-first-learning-task-on-gpu the cuda version core dumps if I specify gpu_device_id to a number other than 0. Please advise.

❯ lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test objective=binary metric=auc
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Info] Construct bin mappers from text data time 7.77 seconds
[LightGBM] [Warning] Metric auc is not implemented in cuda version. Fall back to evaluation on CPU.
[LightGBM] [Info] Finished loading data in 23.403158 seconds
[LightGBM] [Info] Number of positive: 5564616, number of negative: 4935384
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 10500000, number of used features: 28
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529963 -> initscore=0.119997
[LightGBM] [Info] Start training from score 0.119997
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/ssh352/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/ssh352/LightGBM/src/io/cuda/cuda_tree.cpp 37

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] an illegal memory access was encountered /home/ssh352/LightGBM/src/io/cuda/cuda_tree.cpp 37

zsh: abort (core dumped)  lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test

shiyu1994 · 2023-08-13T15:14:29Z

@ssh352 It has been fixed by #6028.

ssh352 · 2023-08-15T01:47:48Z

@shiyu1994 It works now, impressive work!

github-actions · 2023-11-15T00:19:56Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

ssh352 changed the title ~~question: Nvidia A100 cuda 11.8 which version to use?~~ question: Nvidia A100 cuda 11.8 which implementation to use? Aug 3, 2023

jameslamb added the question label Aug 3, 2023

shiyu1994 mentioned this issue Aug 10, 2023

[CUDA] Set GPU device ID in threads #6028

Merged

jameslamb added the awaiting response label Aug 13, 2023

github-actions bot removed the awaiting response label Aug 15, 2023

jameslamb closed this as completed Aug 15, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question: Nvidia A100 cuda 11.8 which implementation to use? #6018

question: Nvidia A100 cuda 11.8 which implementation to use? #6018

ssh352 commented Aug 3, 2023

shiyu1994 commented Aug 4, 2023

ssh352 commented Aug 4, 2023 •

edited

shiyu1994 commented Aug 13, 2023

ssh352 commented Aug 15, 2023

github-actions bot commented Nov 15, 2023

question: Nvidia A100 cuda 11.8 which implementation to use? #6018

question: Nvidia A100 cuda 11.8 which implementation to use? #6018

Comments

ssh352 commented Aug 3, 2023

shiyu1994 commented Aug 4, 2023

ssh352 commented Aug 4, 2023 • edited

shiyu1994 commented Aug 13, 2023

ssh352 commented Aug 15, 2023

github-actions bot commented Nov 15, 2023

ssh352 commented Aug 4, 2023 •

edited