Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on GPU fails (OSError: exception: access violation) #1717

Open
Mtale opened this issue Sep 30, 2018 · 16 comments
Open

Training on GPU fails (OSError: exception: access violation) #1717

Mtale opened this issue Sep 30, 2018 · 16 comments
Assignees
Labels

Comments

@Mtale
Copy link

Mtale commented Sep 30, 2018

I have been trying to run LightGBM GPU for some time without success. The software works well on CPU.

I've compiled LightGBM using MinGW following the instructions here and using MSVC like instructed here. I used Visual Studio 2017 to compile.

No matter the way of compilation, while I try to train a model in Jupyter on Python I get the same error message:

OSError: exception: access violation reading 0x0000000000000020

More details on error below. The referenced error is for sklearn API but the error stays the same if I use lightgbm.cv API.

While trying to run CLI example in the instructions of MinGW compilation, the program fails silently. I have MSVC compilation installed right now and can't reproduce but if you refer to image in the instructions, silent fail occurs after the line Total bins 6143.

output of CLI example

I've run Tensorflow GPU earlier, hence the GPU does work. However, GPU Caps Viewer fails silently while starting. Probably related, but I wan't able to find anything on that problem online.

I've tried suggestions in the following issues:

#836
#1028

Environment info

Operating System: Windows 10 Home

CPU Model: i7 7700
GPU model: Geforce GTX 1060 6Gb
CUDA: 9.0.176.2
OpenCL: 1.2

C++/Python/R version: Python 3.6

Error message

in model(features, test_features, encoding, n_folds)
125 eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
126 eval_names = ['valid', 'train'], categorical_feature = cat_indices,
--> 127 early_stopping_rounds = 100, verbose = 200)
128
129 # Record the best iteration

C:\Anaconda3\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
697 verbose=verbose, feature_name=feature_name,
698 categorical_feature=categorical_feature,
--> 699 callbacks=callbacks)
700 return self
701

C:\Anaconda3\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
500 verbose_eval=verbose, feature_name=feature_name,
501 categorical_feature=categorical_feature,
--> 502 callbacks=callbacks)
503
504 if evals_result:

C:\Anaconda3\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
188 # construct booster
189 try:
--> 190 booster = Booster(params=params, train_set=train_set)
191 if is_valid_contain_train:
192 booster.set_train_data_name(train_data_name)

C:\Anaconda3\lib\site-packages\lightgbm\basic.py in init(self, params, train_set, model_file, silent)
1474 train_set.construct().handle,
1475 c_str(params_str),
-> 1476 ctypes.byref(self.handle)))
1477 # save reference to data
1478 self.train_set = train_set

OSError: exception: access violation reading 0x0000000000000020

@funkindy
Copy link

funkindy commented Oct 11, 2018

Exactly the same problem here with GPU version built with MinGW:

Windows 10,
CMake 3.8,
Boost 1.63.0
CUDA: 9.2

Building goes fine, CLI interface also works okay on GPU with test examples, but python wrapper drops exactly the same error (OSError: exception: access violation writing 0xFFFFFFFF95A80000) on Booster init.

This command is used to install python wrapper: python setup.py install --precompile

@LinYungLun
Copy link

LinYungLun commented Oct 15, 2018

Got similar error on my win10 machine too, works okay on GPU with test examples.

Windows 10,
CMake 3.11,
Boost 1.63.0
CUDA: 9.2

Traceback (most recent call last):
File "", line 95, in
verbose_eval=300
File "C://Users//melo1//Anaconda3//envs//GPU//Lib//site-packages\lightgbm\engine.py", line 192, in train
booster = Booster(params=params, train_set=train_set)
File "C://Users//melo1//Anaconda3//envs//GPU//Lib//site-packages\lightgbm\basic.py", line 1487, in init
train_set.construct().handle,
File "C://Users//melo1//Anaconda3//envs//GPU//Lib//site-packages\lightgbm\basic.py", line 985, in construct
categorical_feature=self.categorical_feature, params=self.params)
File "C://Users//melo1//Anaconda3//envs//GPU//Lib//site-packages\lightgbm\basic.py", line 771, in _lazy_init
self.__init_from_np2d(data, params_str, ref_dataset)
File "C://Users//melo1//Anaconda3//envs//GPU//Lib//site-packages\lightgbm\basic.py", line 835, in __init_from_np2d
ctypes.byref(self.handle)))
OSError: exception: access violation writing 0xFFFFFFFF94A00000

@guolinke
Copy link
Collaborator

ping @huanzhang12

@huanzhang12
Copy link
Contributor

@funkindy @marcualin7412 Could you try if GPU Caps Viewer works on your system?
You can download it here: http://www.ozone3d.net/gpu_caps_viewer/
See if you can view OpenCL devices using it.

For debugging this kind of issue I suggest using the CLI version of LightGBM instead of Python. Could you please run LightGBM using the CLI (command line interface) and get a full output log? This will be really helpful for me to investigate this issue.

@LinYungLun
Copy link

thanks for your quick response.
this is my OpenCL page
image

@LinYungLun
Copy link

@huanzhang12
Using CLI works fine, seems there are something wrong in spyder ide, thanks for your time.

@funkindy
Copy link

@huanzhang12 attached is the output of this command: "../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=gpu

OpenCL page of the GPU Caps Viewer is Okay like the @marcualin7412 one.

GPU_testing.txt

The log looks good so the issue may be specific for python interface.

@StrikerRUS
Copy link
Collaborator

ping @huanzhang12

@StrikerRUS
Copy link
Collaborator

gently ping @huanzhang12

@Mtale
Copy link
Author

Mtale commented Feb 9, 2019

Any information on this issue yet?

@xins-yao
Copy link

xins-yao commented Mar 2, 2019

@Mtale
I have the same issue with u before, and disable my other Inter GPU, now works fine, maybe u could try it.

@pipidog
Copy link

pipidog commented Mar 29, 2019

Any solution on this issue?

I compiled LightGBM (using VS 2017) on my windows 10 machine with 1060 6GB GPU. It runs well in CPU and always got error message:

OSError: exception: access violation reading 0x0000000000000020

when using GPU. Checked all discussion regarding this issue but no useful information so far. Any idea?

@guolinke guolinke added the bug label Aug 1, 2019
@Crazy-LittleBoy
Copy link

I have the similar problem:
OSError: exception: access violation reading 0x000001E028ADE000

@chenjunboBUPT
Copy link

When i use it with basic env, i works well.
But when i want to use it with pytorch env, i encounter this error.
OSError: exception: access violation reading 0x000001E028ADE000

@StrikerRUS StrikerRUS mentioned this issue May 11, 2020
@guolinke
Copy link
Collaborator

guolinke commented Aug 6, 2020

It seems the problem mainly happen in windows. and one comment say disable the intel GPU can help.

I have the same issue with u before, and disable my other Inter GPU, now works fine, maybe u could try it.

you can try this solution. and gentle ping @huanzhang12 for the better word around.

We have a new CUDA implementation (#3160), which does not depend on OpenCL, and it should fix this.

@Numan100
Copy link

Any solution on this issue?

I compiled LightGBM (using VS 2017) on my windows 10 machine with 1060 6GB GPU. It runs well in CPU and always got error message:

OSError: exception: access violation reading 0x0000000000000020

when using GPU. Checked all discussion regarding this issue but no useful information so far. Any idea?

I have similar OSError in reading 0x0000000000000038.
Have you got Any solutions to this bug? @pipidog @guolinke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests