support dataset rows more then max(int32_t) (add CMAKE option) #5540

junpeng0715 · 2022-10-13T07:10:52Z

related to #5454 , add a CMAKE option to switch int32 and int64

guolinke · 2022-10-17T02:16:02Z

@shiyu1994 can you help to review this PR?

junpeng0715 · 2022-10-18T01:16:03Z

@guolinke @shiyu1994
Hi!
Base on #5454 , a Cmake option (USE_DATASET_INT64) has been added to control whether int32 or int64 is used
c_api parts are fixed to int64 and converted into data_size_t during internal processing.

Since the current SWIG does not seem to support int64 very well, some temporary fixes has been carried out in the "CMakeList.txt".
I don't think the current modification affects the use of int32 users.

Since we also use SynapseML, we also changed the LightGBM part of SynapseML to the Long type.
We'll submit to SynapseML later.

For the testing part, we have tested datasets over int32 in our own project (SynapseML + LGBM), and I think it can be learned normally using int64,
But instead of putting this part of the test in LGBM, just adding an int64 compilation part of the pipeline to LGBM.

We sincerely hope that you will review it.

Best Regards

guolinke · 2022-10-18T01:42:59Z

@junpeng0715 can you also add a CI test, to build with int64_t type and run some tests?

guolinke · 2022-10-18T01:43:41Z

@junpeng0715 can you also add a CI test, to build with int64_t type and run some tests?

Oh, I saw the test, thank you!

jameslamb

I'd like a chance to also review these quite impactful changes. Before I do...why was a new PR opened instead of just pushing changes to #5454? Can we now close #5454?

Also, please check you .gitconfig...it looks like many of your commits are not tied to your GitHub account. See #5532 (comment).

junpeng0715 · 2022-10-18T02:20:02Z

@microsoft-github-policy-service agree company="ACN"

junpeng0715 · 2022-10-18T02:43:52Z

I'd like a chance to also review these quite impactful changes. Before I do...why was a new PR opened instead of just pushing changes to #5454? Can we now close #5454?

@jameslamb It is a PR without Cmake Option modification as a backup. I think we can close it.

Also, please check you .gitconfig...it looks like many of your commits are not tied to your GitHub account. See #5532 (comment).

Thank you for your tips, modified it!

jameslamb · 2022-10-18T03:09:47Z

Ok thanks, I closed #5554.

In the future, please just push new commits in response to reviewers' comments, instead of opening new PRs, when reviewers ask for a different approach. That keeps all of the conversation focused in one place, and ensures that other conversations threads that linked to the work you were doing don't need to now to link to multiple very-similar pull requests.

junpeng0715 · 2022-10-18T06:28:08Z

@guolinke @jameslamb @shiyu1994
Hi Guys
How to trigger the pipeline for R-packages?CUDA? etc. I think there were some errors.

junpeng0715 · 2022-10-18T10:50:00Z

Hi @guolinke @shiyu1994
Can you give some suggestions on how to fix CUDA version and R-package(windows-lastest) issues?

junpeng0715 · 2022-10-27T11:19:57Z

Could you please help review for this PR?

We will review this once it is passing CI. Please fix the failing lint task (https://github.com/microsoft/LightGBM/actions/runs/3327507460/jobs/5510896951) and Azure DevOps jobs (https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=13781&view=results).

CUDA builds have also been blocked in this repo for 9 days now...we are blocked waiting on #5546. Sorry for the inconvenience.

I will check lint task, I think failing DevOps jobs related to MacOS is working on #5555, CUDA issue is #5546 as you said.
There also hava some R-package failings in this PR and other PR, unrelated to my codes.

junpeng0715 · 2022-10-31T09:03:35Z

Hi @jameslamb
The Static Analysis issue for this PR has been resolved.

The block issues are as below, am I right?
R-package: #5563
MacOS : #5555
CUDA: #5546

jameslamb · 2022-10-31T13:09:48Z

The block issues are as below, am I right?

No. #5555 is not a blocking issue.

The following need to be fixed before any pull requests in this repo will build successfully.

CUDA jobs: [ci] CUDA CI jobs are broken: "driver/library version mismatch" #5546
support for OpenMP 15+ on macOS: CMake can't find libomp on Apple M1 #5549
MikTex setup on Windows: [ci] [R-package] windows-latest CI jobs are broken: "initexmf.exe" failing #5562

Fixing #5546 requires manual action that only @shiyu1994 can do. #5549 and #5562 will be fixed by #5563.

junpeng0715 · 2022-10-31T13:30:26Z

The block issues are as below, am I right?

No. #5555 is not a blocking issue.

The following need to be fixed before any pull requests in this repo will build successfully.

CUDA jobs: [ci] CUDA CI jobs are broken: "driver/library version mismatch" #5546

support for OpenMP 15+ on macOS: CMake can't find libomp on Apple M1 #5549

MikTex setup on Windows: [ci] [R-package] windows-latest CI jobs are broken: "initexmf.exe" failing #5562

Fixing #5546 requires manual action that only @shiyu1994 can do. #5549 and #5562 will be fixed by #5563.

Thank you @jameslamb ! I understand clearly now!

jameslamb · 2022-11-04T01:18:21Z

@junpeng0715 The CI issues have been fixed. Please merge latest master into this branch so we can review.

junpeng0715 · 2022-11-04T08:44:28Z

@junpeng0715 The CI issues have been fixed. Please merge latest master into this branch so we can review.

@jameslamb Merged the lastest master. Got a new error in Linux inference[Initialize job] , "File not found: 'docker'"

junpeng0715 · 2022-11-08T00:58:54Z

@jameslamb
I resolved the conflicts, could you help trigger pipeline ?

jameslamb · 2022-11-08T02:58:15Z

could you help trigger pipeline

sure, re-triggered. As I mentioned previously (#5540 (comment)), you could also consider making a separate, small, non-controversial contribution (like one of the items from #5313) so that you wouldn't be considered a first-time contributor any more and would need to @ us every time you want the CI jobs to run.

junpeng0715 · 2022-11-08T16:01:34Z

Hi @jameslamb
Got R-package error, I think it is also an OpenMP issue

test-omp.c:1:10: fatal error: 'omp.h' file not found
#include <omp.h>
^~~~~~~
1 error generated.
*** OpenMP not supported! data.table uses OpenMP to automatically

junpeng0715 · 2022-11-10T08:30:19Z

hi @jameslamb and @guolinke
I will no longer be on this task, Thank you for your long term support, may continue to contribute to LightGBM through my personal account.
Could you discuss the scheduling of this PR, microsoft/SynapseML#1684 is waiting for the plan.

@sinhrks will handle the PR if further modification required. He belongs to the same company and agreed with CLA.

Best Regards

guolinke · 2022-11-10T08:36:05Z

@junpeng0715
Sorry to hear that, thank you so much for the great work!
And looking forward to your contributions in the future!

We will move this PR forward, and discuss at microsoft/SynapseML#1684

jameslamb · 2023-02-25T22:31:32Z

We will move this PR forward,

@guolinke are you still interested in pursuing this PR? It's been 4 months since the most recent commit.

guolinke · 2023-02-26T02:22:49Z

I think we can work on the 4.0.0 release first, then revisit these PRs after that. cc @shiyu1994 .

We will move this PR forward,

@guolinke are you still interested in pursuing this PR? It's been 4 months since the most recent commit.

sinhrks · 2023-02-26T03:08:22Z

Thx for following up. Is it ok for me to prepare another PR to rebase this?

guolinke · 2023-02-26T14:33:14Z

Thx for following up. Is it ok for me to prepare another PR to rebase this?

Sure, please go ahead.

PankajMerisha · 2023-07-18T13:52:15Z

hi Team,
We are unable to train our models due to this issue. please refer: microsoft/SynapseML#2014
If there is any workaround please let me know

junpeng.li and others added 3 commits October 11, 2022 10:48

[microsoft#5454]support dataset rows more then max(int32_t)

087793d

Fix cpp test code microsoft#5454

b32427a

modify GPU data type to fit int64

6eac042

junpeng0715 requested review from guolinke, shiyu1994, StrikerRUS, jameslamb and jmoralez as code owners October 13, 2022 07:10

junpeng0715 changed the title ~~[WIP] support dataset rows more then max(int32_t)~~ [WIP] support dataset rows more then max(int32_t) (add CMAKE option) Oct 13, 2022

junpeng0715 mentioned this pull request Oct 13, 2022

[WIP] support dataset rows more then max(int32_t) #5454

Closed

jameslamb requested changes Oct 18, 2022

View reviewed changes

jameslamb added the feature label Oct 18, 2022

junpeng0715 force-pushed the dev_1011 branch from 9ccfc2c to b6b5d63 Compare October 18, 2022 02:40

junpeng0715 added a commit to junpeng0715/LightGBM that referenced this pull request Oct 18, 2022

modify R for microsoft#5540

2ccdb39

junpeng0715 force-pushed the dev_1011 branch from 2ccdb39 to 6eac042 Compare October 18, 2022 04:43

support dataset rows more then max(int32_t) (add CMAKE option)

2c43490

junpeng0715 requested review from jameslamb and removed request for jmoralez, shiyu1994, guolinke and StrikerRUS October 18, 2022 06:25

junpeng0715 added 3 commits October 28, 2022 05:32

support dataset rows more then max(int32_t) (add CMAKE option)

9f298d2

support dataset rows more then max(int32_t) (add CMAKE option)

3402415

support dataset rows more then max(int32_t) (add CMAKE option)

fc1d58e

junpeng0715 changed the title ~~[WIP] support dataset rows more then max(int32_t) (add CMAKE option)~~ support dataset rows more then max(int32_t) (add CMAKE option) Oct 31, 2022

jameslamb added awaiting review in progress labels Nov 4, 2022

jameslamb requested review from guolinke and shiyu1994 November 4, 2022 01:17

Merge branch 'master' into dev_1011

daebe44

resolve conflicts

aa9be4c

resolve lint

0ebc4ab

jameslamb added the awaiting response label Feb 25, 2023

jmoralez mentioned this pull request Apr 27, 2023

Dask LightGBM breaks if num_rows * bagging_fraction > int32_t max #5861

Open

PankajMerisha mentioned this pull request Jul 17, 2023

[BUG] Facing issue with LightGBMClassifier training, it stops after iteration 0 microsoft/SynapseML#2014

Closed

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support dataset rows more then max(int32_t) (add CMAKE option) #5540

support dataset rows more then max(int32_t) (add CMAKE option) #5540

junpeng0715 commented Oct 13, 2022

guolinke commented Oct 17, 2022

junpeng0715 commented Oct 18, 2022

guolinke commented Oct 18, 2022

guolinke commented Oct 18, 2022

jameslamb left a comment

junpeng0715 commented Oct 18, 2022

junpeng0715 commented Oct 18, 2022

jameslamb commented Oct 18, 2022 •

edited

Loading

junpeng0715 commented Oct 18, 2022

junpeng0715 commented Oct 18, 2022

junpeng0715 commented Oct 27, 2022

junpeng0715 commented Oct 31, 2022

jameslamb commented Oct 31, 2022

junpeng0715 commented Oct 31, 2022

jameslamb commented Nov 4, 2022

junpeng0715 commented Nov 4, 2022 •

edited

Loading

junpeng0715 commented Nov 8, 2022

jameslamb commented Nov 8, 2022

junpeng0715 commented Nov 8, 2022

junpeng0715 commented Nov 10, 2022

guolinke commented Nov 10, 2022

jameslamb commented Feb 25, 2023

guolinke commented Feb 26, 2023

sinhrks commented Feb 26, 2023

guolinke commented Feb 26, 2023

PankajMerisha commented Jul 18, 2023

support dataset rows more then max(int32_t) (add CMAKE option) #5540

Are you sure you want to change the base?

support dataset rows more then max(int32_t) (add CMAKE option) #5540

Conversation

junpeng0715 commented Oct 13, 2022

guolinke commented Oct 17, 2022

junpeng0715 commented Oct 18, 2022

guolinke commented Oct 18, 2022

guolinke commented Oct 18, 2022

jameslamb left a comment

Choose a reason for hiding this comment

junpeng0715 commented Oct 18, 2022

junpeng0715 commented Oct 18, 2022

jameslamb commented Oct 18, 2022 • edited Loading

junpeng0715 commented Oct 18, 2022

junpeng0715 commented Oct 18, 2022

junpeng0715 commented Oct 27, 2022

junpeng0715 commented Oct 31, 2022

jameslamb commented Oct 31, 2022

junpeng0715 commented Oct 31, 2022

jameslamb commented Nov 4, 2022

junpeng0715 commented Nov 4, 2022 • edited Loading

junpeng0715 commented Nov 8, 2022

jameslamb commented Nov 8, 2022

junpeng0715 commented Nov 8, 2022

junpeng0715 commented Nov 10, 2022

guolinke commented Nov 10, 2022

jameslamb commented Feb 25, 2023

guolinke commented Feb 26, 2023

sinhrks commented Feb 26, 2023

guolinke commented Feb 26, 2023

PankajMerisha commented Jul 18, 2023

jameslamb commented Oct 18, 2022 •

edited

Loading

junpeng0715 commented Nov 4, 2022 •

edited

Loading