Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support on ppc64le #2937

Closed
everton1984 opened this issue Mar 23, 2020 · 19 comments
Closed

GPU support on ppc64le #2937

everton1984 opened this issue Mar 23, 2020 · 19 comments
Assignees

Comments

@everton1984
Copy link

everton1984 commented Mar 23, 2020

Currently there is no way to execute LightGBM on ppc64le with GPU since it only has OpenCL support. What would be the correct approach to add CUDA code along with OpenCL on LightGBM? If I understand correctly, adding CUDA kernels to treelearner would be enough right?
There are other people interested on this, #1905 and #1129 for example.

@ceseo
Copy link

ceseo commented Mar 23, 2020

cc @ceseo

@ceseo
Copy link

ceseo commented Mar 30, 2020

Our plan is to contribute the CUDA kernels (histogram16/64/256) to treelearner and create a common API for OpenCL and CUDA (a common template class), so it minimizes the amount of changes needed in the current code base.

At first, CUDA would be conditioned to ppc64le/linux only.

Does that look reasonable?

@guolinke
Copy link
Collaborator

guolinke commented Apr 1, 2020

Thanks @ceseo and @everton1984 , the contribution is very welcome!
ping @huanzhang12 for the detail discussion.

@everton1984
Copy link
Author

@huanzhang12 As @ceseo mentioned, we want to add a new set of histograms written in cuda, use the current gputreelearner as a base class and add specialisations with non-common code for OpenCL and CUDA. I think this would add minimal changes to code outside gputreelearner. Would that be acceptable? Are you aware of any GPU code outside that part that we might have missed?

@huanzhang12
Copy link
Contributor

@everton1984 It will be a great contribution if you can add CUDA equivalent kernels to LightGBM! Yes you can inherit the current gputreelearner as the base class, but currently many functions in GPU tree learner involve memory management in OpenCL (e.g., enqueue_write_buffer, enqueue_unmap_buffer, etc), so most of the methods cannot be directly reused. I suggest making an abstraction layer for memory management (e.g., use a wrapper to handle memory related function calls, and forward the call to OpenCL/CUDA backend accordingly), so that many parts of existing gputreelearner can be reused.

@everton1984
Copy link
Author

@huanzhang12 My idea was to implement something like

#include <memory>
#include <iostream>

class GPUTreeLearner
{
public:
    GPUTreeLearner() {

    };

    /* common GPU code */
    virtual void doStuff(){
    };

    void A() {
        std::cout << "GPU common code\n";
    };
};

class OCLTreeLearner : public GPUTreeLearner
{
public:
    OCLTreeLearner() {
    };

    /* OpenCL specific code */
    void doStuff() {
        std::cout << "OCL only code\n";
    };
};

class CUDATreeLearner : public GPUTreeLearner
{
public:
    CUDATreeLearner() {
    };

    /* CUDA specific code */
    void doStuff() {
        std::cout << "CUDA only code\n";
    };
};

class GPUTreeLearnerFactory
{
public:
    static std::unique_ptr<GPUTreeLearner> GetGPU(bool is_cuda)
    {
        if( is_cuda ){
           return std::make_unique<CUDATreeLearner>();
        } else {
           return std::make_unique<OCLTreeLearner>();
        }
    };
};

int main() {
    /* random code */
    std::unique_ptr<GPUTreeLearner> pcu = GPUTreeLearnerFactory::GetGPU(false);
    std::unique_ptr<GPUTreeLearner> pcl = GPUTreeLearnerFactory::GetGPU(true);
    pcu->A();
    pcu->doStuff();
    pcl->A();
    pcl->doStuff();

    return 0;
}

so GPUTreeLearner would keep the common code. I think this approach is easily maintainable and change very little code outside GPUTreeLearner.

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Apr 2, 2020

One current limitation to our CUDA code is that it can NOT coexist (AKA compiled at the same time) with GPU / OCL code (similar to the constraint that CPU and GPU / OCL can NOT coexist). Do you see a need for both simultaneously in the same compilation? If so, hopefully we can overcome this.

We also assume that having parity with GPU / OCL and CUDA is a given? That to us means single precision 16, 64 & 256 histogram support. Our initial CUDA version will "probably" also have double precision histograms. So it would be nice for someone with GPU / OCL expertise to add that functionality into the project.

@huanzhang12
Copy link
Contributor

@everton1984 The implementation skeleton looks great! I suggested an abstraction layer on memory management as it may help reduce code duplication. But you can determine what is the best approach to do this.

@ChipKerchner The OpenCL implementation does support double precision. It can be enabled using the flag gpu_use_dp=true. Double precision can be very slow on NVIDIA GPUs though unless using the very expensive "datacenter" GPUs (P100, V100, etc).

@ceseo
Copy link

ceseo commented Apr 6, 2020

@huanzhang12 the original motivation of this port was exactly running on Tesla V100s on POWER9 systems. Do you have any objections starting the port with double precision support only at first? Most ppc64le systems support only server-grade GPUs anyway.

@bordaw
Copy link

bordaw commented Apr 8, 2020

@huanzhang12 we have also encountered the need for histograms with bin size >256. We have plans for supporting Histograms for larger bins. I think OpenCL implementation would benefit from this capability as well.

@everton1984
Copy link
Author

@huanzhang12 we have also encountered the need for histograms with bin size >256. We have plans for supporting Histograms for larger bins. I think OpenCL implementation would benefit from this capability as well.

@bordaw This is not relevant to the current issue, we're discussing enablement of GPU on ppc64le. Create another Issue to discuss new features please.

@huanzhang12
Copy link
Contributor

huanzhang12 commented Apr 8, 2020

@ceseo @everton1984 Don't worry about single/double precision too much at this time - as long as we have the double precision version working, it is not difficult to add single precision support. If you look at the current GPU code there are only a few places involving different code path for handling single/double precision, and the differences between the two in handling are not that big. But it is better to keep single precision in mind when designing the code work flow to ease further extensions.

@bordaw It is possible to support large bins but performance can be much worse, because if bin size is less than 256 it is possible to fit them in the local memory in GPU with much higher bandwidth. If you would like to work on this you are very welcome to open a new issue.

@bordaw
Copy link

bordaw commented Apr 8, 2020

@huanzhang12 Yes, I will create another issue. Also, I agree that the code changes required for single/double precision histogram are not significant.

@StrikerRUS StrikerRUS mentioned this issue May 11, 2020
@ChipKerchner
Copy link
Contributor

We are almost ready to push the initial candidate for the CUDA branch.

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Jul 1, 2020

The pull request has been active for almost 3 weeks and we're not getting much feedback.

I'd like some assistance with the continuous integration systems. It shows problems but my testing on an AMD64 system with a GPU (OCL) shows no issues with the unit tests. I am unable to produce an environment that allows me to run the automated testing. Please help!
@huanzhang12 @guolinke @StrikerRUS

@guolinke
Copy link
Collaborator

guolinke commented Jul 2, 2020

sorry @ChipKerchner , I will look at this today.

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Aug 26, 2020

@huanzhang12 Our resources are starting to be moved off this project. We would like to get this feature approved and integrated into master ASAP if possible. Please let us know any outstanding issues.

@StrikerRUS
Copy link
Collaborator

Implemented in #3160. Many thanks to all involved people for the amazing work! We are looking forward for further CUDA improvements, e.g. Windows support.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants