Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CUDA-based GPU build #3160

Merged
merged 124 commits into from Sep 20, 2020
Merged

Conversation

ChipKerchner
Copy link
Contributor

@ChipKerchner ChipKerchner commented Jun 11, 2020

This is the initial CUDA work. It should work similarly to the GPU/OCL work.

To compile use - 'USE_CUDA=1'. Python unit tests should include 'device': 'cuda' where needed.

All unit tests pass for CPU, GPU/OCL and CUDA. CPU & CUDA were tested on ppc64le and GPU/OCL was tested on x86_64

@ghost
Copy link

ghost commented Jun 11, 2020

CLA assistant check
All CLA requirements met.

@jameslamb
Copy link
Collaborator

@ChipKerchner Thanks for your contribution! Before we review, can you please remove the unrelated commits from this PR? I'm not sure what is causing them.

I think that you will have to git rebase master and fix conflicts.

@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Jun 11, 2020

Conflicts - src/c_api.cpp & src/boosting/gbdt.cpp should be taken from cuda branch. Not sure what is going on with R-package/src/lightgbm_R.cpp (recent commit?)

How do I resolve these since they aren't showing up on my branch?

@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Jun 11, 2020

@ChipKerchner Thanks for your contribution! Before we review, can you please remove the unrelated commits from this PR? I'm not sure what is causing them.

I think that you will have to git rebase master and fix conflicts.

@jameslamb I had to rebase from one github to another. This is the result. I'm not a git master so I'll try but I may ask for help.

This is the result

git rebase master
Current branch cuda is up to date.

@jameslamb
Copy link
Collaborator

@ChipKerchner Thanks for your contribution! Before we review, can you please remove the unrelated commits from this PR? I'm not sure what is causing them.
I think that you will have to git rebase master and fix conflicts.

@jameslamb I had to rebase from one github to another. This is the result. I'm not a git master so I'll try but I may ask for help.

This is the result

git rebase master
Current branch cuda is up to date.

your local master is probably very out of date with LightGBM. You can run something like this in your repo:

# get current state of LightGBM master branch
git remote add upstream git@github.com:microsoft/LightGBM.git
git checkout master
git pull upstream master

# update the master branch on your fork
git push origin master

# update this feature branch
git checkout cuda
git rebase master

# push your changes
git push origin cuda --force

If you've never done this before, be sure to back up your code somewhere. git push --force is a non-reversible destructive action.

@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Jun 11, 2020

I did as you suggested. Hopefully everything is fine now.

This branch is even with microsoft:master.

@jameslamb
Copy link
Collaborator

I did as you suggested. Hopefully anything is fine now.

This branch is even with microsoft:master.

great! It looks like only changes for your PR are in the history now, thank you.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up the git stuff! I've left a few additional minor requests here.

Can you please get continuous integration working? Right now many of the CI jobs are failing, which might mean that this pull request in its current state has breaking changes.

Some time soon you will hear from the C++ maintainers with questions about the approach and the problem this pull request solves.

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
build_LGBM.232.sh Outdated Show resolved Hide resolved
install_LGBM.232.sh Outdated Show resolved Hide resolved
src/boosting/gbdt.cpp Outdated Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jun 11, 2020

I believe that the first thing should be getting @huanzhang12's approval of the concept as the main expert in GPU computing. Otherwise, many further actions and reviews can be pointless.

@ChipKerchner
Copy link
Contributor Author

I believe that the first thing should be getting @huanzhang12's approval of the concept as the main expert in GPU computing. Otherwise, many further actions and reviews can be pointless.

Already started a conversion a while back.

#2937

@StrikerRUS
Copy link
Collaborator

Already started a conversion a while back.

Ah, I see, OK! But that were just words and here is the code.

@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Jun 11, 2020

Could some one tell me why the automated checks are failing and what I should do to fix them? I have a minimal x86_64 (AMD64) with CUDA testing environment (MINGW64) with MSVC Debug 2019 (14.2). Main development was performed on ppc64le

@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Jun 15, 2020

Can you please get continuous integration working? Right now many of the CI jobs are failing, which might mean that this pull request in its current state has breaking changes.

@jameslamb I looked at the logs and such. I'm not sure what is failing, how I can reproduce it on my system(s) and how to fix it. Help would be much appreciated.

Update: I fixed a problem with accidentally override gpu_use_dp for the GPU version.

These are the problems I don't quite understand yet.... Anyone that can help, please let me know.

  1. Looks like some errors during setup.
  2. Regular version is fine - bdist and sdist are not.
  3. There seems to be a crash which I can not duplicate in test_consistency.py. Maybe others?

@jameslamb
Copy link
Collaborator

Can you please get continuous integration working? Right now many of the CI jobs are failing, which might mean that this pull request in its current state has breaking changes.

@jameslamb I looked at the logs and such. I'm not sure what is failing, how I can reproduce it on my system(s) and how to fix it. Help would be much appreciated.

Update: I fixed a problem with accidentally override gpu_use_dp for the GPU version.

These are the problems I don't quite understand yet.... Anyone that can help, please let me know.

  1. Looks like some errors during setup.
  2. Regular version is fine - bdist and sdist are not.
  3. There seems to be a crash which I can not duplicate in test_consistency.py. Maybe others?

I looked through the failing jobs and other than the failing lint task, I'm not sure how to help with the other issues you're seeing. I'm not a primary C++ maintainer for this project, and you're touching a lot of files that I am not familiar with, so I'm not the best person to help you fix these issues.

I can say this...I was surprised to see how many changes are being added that are not inside #ifdef USE_CUDA blocks. Those are likely the sources of any failures from existing CI jobs.

Building on what @StrikerRUS said in #3160 (comment), I will reserve more comments until @huanzhang12 can give a thorough review.

@ChipKerchner
Copy link
Contributor Author

@StrikerRUS Since you were recently working on parts of the continuous-integration and I have no way to duplicate that setup, could you help me and tell me why various parts are failing or crashing?

@StrikerRUS
Copy link
Collaborator

@ChipKerchner I see that failing tests are from this file https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_consistency.py. Since the error is segfault and it doesn't depend on OS, I believe that the root cause is something fundamental in cpp code.

To reproduce the issue just run pytest ./tests/python_package_test/test_consistency.py. If you need an exact environment, you can use this our Ubuntu Docker: https://hub.docker.com/r/lightgbm/vsts-agent.

@ChipKerchner
Copy link
Contributor Author

To reproduce the issue just run pytest ./tests/python_package_test/test_consistency.py. If you need an exact environment, you can use this our Ubuntu Docker: https://hub.docker.com/r/lightgbm/vsts-agent.

I am able to do this:

docker run -it -e VSTS_ACCOUNT=<my account> -e VSTS_TOKEN=<my token> lightgbm/vsts-agent:ubuntu-14.04-dev

Azure starts up and:

>> Connect:
Connecting to server ...
>> Register Agent:
Scanning for tool capabilities.
Connecting to the server.
Successfully added the agent
Testing agent connection.
2020-06-22 20:12:44Z: Settings Saved.
Scanning for tool capabilities.
Connecting to the server.
2020-06-22 20:12:46Z: Listening for Jobs

I'm not sure what to do next. Is there anyone that can help me? @StrikerRUS @guolinke

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChipKerchner Thanks for enabling multi-GPU training! Hope it indeed works! However, I haven't got your input on speedup factor gained from multiple GPUs (#3160 (comment)).

Please consider addressing some more minor comments related to the code quality and I guess this PR can be finally merged if no one else is planning to provide their review.

CMakeLists.txt Outdated Show resolved Hide resolved
include/LightGBM/cuda/cuda_utils.h Outdated Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
include/LightGBM/cuda/vector_cudahost.h Outdated Show resolved Hide resolved
src/treelearner/kernels/histogram_16_64_256.cu Outdated Show resolved Hide resolved
src/treelearner/kernels/histogram_16_64_256.cu Outdated Show resolved Hide resolved
src/treelearner/kernels/histogram_16_64_256.cu Outdated Show resolved Hide resolved
src/treelearner/kernels/histogram_16_64_256.cu Outdated Show resolved Hide resolved
src/treelearner/kernels/histogram_16_64_256.cu Outdated Show resolved Hide resolved
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have nothing else to recommend (just the question about copyrights, apologies if that has already been addressed).

I don't know enough to thoroughly review most of the code in this PR, but since all the R CI jobs are passing I'm satisfied that it doesn't impact the R package 😀

Thanks for all the hard work @ChipKerchner !

include/LightGBM/cuda/cuda_utils.h Show resolved Hide resolved
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChipKerchner Thank you and all people who were working on this PR so much for all your hard work and patience! We really appreciate this valuable contribution!
I don't have any other comments.

@jameslamb Please update your requested changes review status.

@guolinke
Copy link
Collaborator

I am going to merge this. Thank you so much ! @ChipKerchner @austinpagan

@StrikerRUS
Copy link
Collaborator

@ChipKerchner Hello! Could you please provide a GitHub nickname of a person we can contact regarding CUDA bugs?

@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Nov 9, 2020

@ChipKerchner Hello! Could you please provide a GitHub nickname of a person we can contact regarding CUDA bugs?

@StrikerRUS Could you give a little detail to the nature of the bugs, which files it may be in, and how to reproduce it?

@StrikerRUS
Copy link
Collaborator

@ChipKerchner Thanks a lot for the quick reply!

The error is

lightgbm.basic.LightGBMError: [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414

and happens somewhere around here

CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice, stream_[device_id]));

This bug can be reproduced on both multi-GPU configuration (see #3450 (comment)) and single-GPU machine (see #3424 (comment)). The easiest way to reproduce it is to run simple_example.py inside NVIDIA Docker:

export ROOT_DOCKER_FOLDER=/LightGBM
cat > docker.env <<EOF
TASK=cuda
COMPILER=gcc
GITHUB_ACTIONS=true
OS_NAME=linux
BUILD_DIRECTORY=$ROOT_DOCKER_FOLDER
CONDA_ENV=test-env
PYTHON_VERSION=3.8
EOF
cat > docker-script.sh <<EOF
export CONDA=\$HOME/miniconda
export PATH=\$CONDA/bin:\$PATH
nvidia-smi
$ROOT_DOCKER_FOLDER/.ci/setup.sh || exit -1
$ROOT_DOCKER_FOLDER/.ci/test.sh
source activate \$CONDA_ENV
cd \$BUILD_DIRECTORY/examples/python-guide/
python simple_example.py
EOF
sudo docker run --env-file docker.env -v "$GITHUB_WORKSPACE":"$ROOT_DOCKER_FOLDER" --rm --gpus all nvidia/cuda:11.0-devel-ubuntu20.04 /bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh

Simply adjust $GITHUB_WORKSPACE to your working directory with LightGBM repo in the script above.

If you do not mind, lets move our further discussion to #3450 because this PR already has more than 400 comments.

@StrikerRUS
Copy link
Collaborator

Gently ping @ChipKerchner .

@ChipKerchner
Copy link
Contributor Author

@StrikerRUS Sorry, we won't be able to look at this until at least next month.

@StrikerRUS
Copy link
Collaborator

@ChipKerchner OK, got it! Thanks for your response!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants