Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM GPU not working with CUDA 10.0 on RHEL 7.x #2075

Closed
nikolayvoronchikhin opened this issue Apr 2, 2019 · 4 comments
Closed

LightGBM GPU not working with CUDA 10.0 on RHEL 7.x #2075

nikolayvoronchikhin opened this issue Apr 2, 2019 · 4 comments
Assignees

Comments

@nikolayvoronchikhin
Copy link

Environment info

Operating System:
RHEL 7.5/7.6

CPU/GPU model:
NVIDIA Tesla P100-PCIE-16GB

C++/Python/R version:
Python 2.7 & Python 3.6
Microsoft R Open 3.4.3

LightGBM version or commit hash:
lightgbm==2.2.4

Error message for lightGBM binary

[~]$ ./lightgbm config=lightgbm_gpu.conf data=higgs.train valid=higgs.test objective=binary metric=auc
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Finished loading data in 13.491848 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 5564616, number of negative: 4935384
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1535
[LightGBM] [Info] Number of data: 10500000, number of used features: 28
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: Tesla P100-PCIE-16GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
Segmentation fault

Error message for lightGBM in python 3.6/2.7

[~]$ python36
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
/apps/dslab/anaconda/python3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
from sklearn.metrics import accuracy_score
import lightgbm as lgb

Loading the dataset

... iris = load_iris()

X = iris.data
y = iris.target

... #print(y)
...
... train_data = lgb.Dataset(X, label=y)

... params = {
... 'objective': 'multiclass',
... 'feature_fraction': 1,
... 'bagging_fraction': 1,
... 'num_class':3,
... 'verbose': -1,
... 'device' : 'gpu'
... }

... gbm = lgb.train(params, train_data, num_boost_round=10)
Segmentation fault

Success for lightGBM R

The following issue helped make lightGBM GPU work in RStudio:
#964

Steps to reproduce

  1. After installing NVIDIA Driver and CUDA 10.0, make sure /usr/local/cuda points to /usr/local/cuda-10.0
  2. Followed steps mentioned in this issue: Build w/ GPU but got CPU-only ver install in python (Ubuntu 16.04) #715
    But that still results in segmentation fault for me.

Can you suggest any other changes needed or is CUDA 10.0 not supported yet?

@nikolayvoronchikhin nikolayvoronchikhin changed the title LightGBM GPU not working with CUDA 10.0 LightGBM GPU not working with CUDA 10.0 on RHEL 7.x Apr 9, 2019
@StrikerRUS
Copy link
Collaborator

or is CUDA 10.0 not supported yet?

We have a case of successful compilation with Boost 1.69.0 and CUDA 10.0: #2081 (comment), but that was Windows...

ping @huanzhang12

@nikolayvoronchikhin
Copy link
Author

Thanks @StrikerRUS
Hi @huanzhang12 , can you help on this issue here? Do you need any other details?

@nikolayvoronchikhin
Copy link
Author

@huanzhang12 , It actually works already!
I read your example here: https://github.com/huanzhang12/lightgbm-gpu
Just need to use these 2 params instead, 'gpu_platform_id': 0, 'gpu_device_id': 0

[~]$ python36
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
/apps/dslab/anaconda/python3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
from sklearn.metrics import accuracy_score
import lightgbm as lgb

Loading the dataset

... iris = load_iris()

X = iris.data
y = iris.target
train_data = lgb.Dataset(X, label=y)
params = { 'objective': 'multiclass', 'feature_fraction': 1, 'bagging_fraction': 1, 'num_class':3, 'verbose': -1, 'gpu_platform_id': 0, 'gpu_device_id': 0 }
gbm = lgb.train(params, train_data, num_boost_round=10)

@StrikerRUS
Copy link
Collaborator

@nikolayvoronchikhin Glad that your problem has been solved! And thanks a lot for sharing your workaround here.
You can read more recent content about these params here: https://lightgbm.readthedocs.io/en/latest/GPU-Targets.html#query-opencl-devices-in-your-system

@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants