Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: Nvidia A100 cuda 11.8 which implementation to use? #6018

Closed
ssh352 opened this issue Aug 3, 2023 · 5 comments
Closed

question: Nvidia A100 cuda 11.8 which implementation to use? #6018

ssh352 opened this issue Aug 3, 2023 · 5 comments
Labels

Comments

@ssh352
Copy link

ssh352 commented Aug 3, 2023

Hi, thanks for the great package, I have multiple A100 on the server, do you recommend OpenCL version or cuda version? which is faster and has more features? I tried both and both seem to work.

However I've some trouble specifying the gpu id with cuda version. Thanks!

@ssh352 ssh352 changed the title question: Nvidia A100 cuda 11.8 which version to use? question: Nvidia A100 cuda 11.8 which implementation to use? Aug 3, 2023
@shiyu1994
Copy link
Collaborator

Thanks for using LightGBM. Currently both versions don't support multi-GPU training. We'll add multi-GPU support for cuda version very soon.

@ssh352
Copy link
Author

ssh352 commented Aug 4, 2023

@shiyu1994 Thanks, when I run the Higgs example from the documentation, https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html#run-your-first-learning-task-on-gpu the cuda version core dumps if I specify gpu_device_id to a number other than 0. Please advise.

❯ lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test objective=binary metric=auc
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Info] Construct bin mappers from text data time 7.77 seconds
[LightGBM] [Warning] Metric auc is not implemented in cuda version. Fall back to evaluation on CPU.
[LightGBM] [Info] Finished loading data in 23.403158 seconds
[LightGBM] [Info] Number of positive: 5564616, number of negative: 4935384
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 10500000, number of used features: 28
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529963 -> initscore=0.119997
[LightGBM] [Info] Start training from score 0.119997
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/ssh352/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/ssh352/LightGBM/src/io/cuda/cuda_tree.cpp 37

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] an illegal memory access was encountered /home/ssh352/LightGBM/src/io/cuda/cuda_tree.cpp 37

zsh: abort (core dumped)  lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test

@shiyu1994
Copy link
Collaborator

@ssh352 It has been fixed by #6028.

@ssh352
Copy link
Author

ssh352 commented Aug 15, 2023

@shiyu1994 It works now, impressive work!

Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants