Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] change the destructor of ScoreUpdater to virtual (fixes #5400) #5403

Merged
merged 1 commit into from
Aug 20, 2022

Conversation

shiyu1994
Copy link
Collaborator

This is to fix #5400.

@jameslamb
Copy link
Collaborator

@shiyu1994 please add one of the labels from

to every pull request you open, so it will be correctly categorized in the release notes

@jameslamb jameslamb added the fix label Aug 4, 2022
@jameslamb
Copy link
Collaborator

jameslamb commented Aug 6, 2022

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/2809833096

Status: failure ❌.

@jameslamb
Copy link
Collaborator

jameslamb commented Aug 6, 2022

also @shiyu1994 I noticed you created this PR from your fork...can you please just create branches here in microsoft/LightGBM in the future?

That makes it easier for other maintainers to help and collaborate, e.g. avoids issues like #5371.

@shiyu1994
Copy link
Collaborator Author

@jameslamb Thanks. Will always create a branch on microsoft/LightGBM in the future.

@shiyu1994
Copy link
Collaborator Author

shiyu1994 commented Aug 10, 2022

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/2831189390

Status: failure ❌.

@jameslamb
Copy link
Collaborator

Sorry @shiyu1994 , it looks like valgrind job is timing out. I'm not sure why, but it has been a few weeks since we last ran that job and I know there have been some significant changes on master in that time.

I opened #5414 increasing the timeout, hopefully that will help.

@jameslamb
Copy link
Collaborator

jameslamb commented Aug 11, 2022

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/2840116820

Status: failure ❌.

@jameslamb
Copy link
Collaborator

@shiyu1994 the valgrind test is now failing for what looks like a real reason, not a timeout.

==4762== 352 bytes in 1 blocks are possibly lost in loss record 152 of 2,904
==4762==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==4762==    by 0x40147D9: calloc (rtld-malloc.h:44)
==4762==    by 0x40147D9: allocate_dtv (dl-tls.c:375)
==4762==    by 0x40147D9: _dl_allocate_tls (dl-tls.c:634)
==4762==    by 0x4DF4834: allocate_stack (allocatestack.c:430)
==4762==    by 0x4DF4834: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==4762==    by 0x575B1EF: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==4762==    by 0x5751A10: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==4762==    by 0x16C99F47: OMP_NUM_THREADS() (openmp_wrapper.h:22)
==4762==    by 0x16FE6E64: OMP_SET_NUM_THREADS(int) (openmp_wrapper.h:29)
==4762==    by 0x16FDB3AC: LGBM_DatasetCreateFromCSC (c_api.cpp:1462)
==4762==    by 0x1700C83C: LGBM_DatasetCreateFromCSC_R (lightgbm_R.cpp:183)
==4762==    by 0x4953D8A: R_doDotCall (dotcode.c:627)
==4762==    by 0x495ED3E: do_dotcall (dotcode.c:1284)
==4762==    by 0x499E245: Rf_eval (eval.c:851)

(build link)

That might not be from this PR (it has been a while since we have run r-valgrind), but it should be fixed. Given that it's coming from LGBM_DatasetCreateFromCSC, I suspect it might be related to #5344, but not sure.

Can you please look into it?

@shiyu1994
Copy link
Collaborator Author

@jameslamb I did not find any potential memory linkage introduced by the changes of LGBM_DatasetCreateFromCSC from #5344. I noticed that there's one false positive case documented about LGBM_DatasetCreateFromCSC

# one error caused by a false positive between valgrind and openmp is allowed
# ==2063== 336 bytes in 1 blocks are possibly lost in loss record 153 of 2,709
# ==2063== at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
# ==2063== by 0x40149CA: allocate_dtv (dl-tls.c:286)
# ==2063== by 0x40149CA: _dl_allocate_tls (dl-tls.c:532)
# ==2063== by 0x5702322: allocate_stack (allocatestack.c:622)
# ==2063== by 0x5702322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
# ==2063== by 0x56D0DDA: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
# ==2063== by 0x56C88E0: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
# ==2063== by 0x1544D29C: LGBM_DatasetCreateFromCSC (c_api.cpp:1286)
# ==2063== by 0x1546F980: LGBM_DatasetCreateFromCSC_R (lightgbm_R.cpp:91)
# ==2063== by 0x4941E2F: R_doDotCall (dotcode.c:634)
# ==2063== by 0x494CCC6: do_dotcall (dotcode.c:1281)
# ==2063== by 0x499FB01: bcEval (eval.c:7078)
# ==2063== by 0x498B67F: Rf_eval (eval.c:727)
# ==2063== by 0x498E414: R_execClosure (eval.c:1895)
.
Is it related to the new error message? They seem quite similar.

@jameslamb
Copy link
Collaborator

oh wow @shiyu1994 you are right! Very nice attention to detail!!

This looks like EXACTLY that same issue, even the same stack trace. I guess the test is failing because the "possibly lost" went from 336 bytes.

# ==2063== 336 bytes in 1 blocks are possibly lost in loss record 153 of 2,709

to 352 (#5403 (comment))

The fact that it's exactly 16 bytes is interesting. I wonder if maybe, for example, some change from int16 to int32 or something somewhere is causing this? That could be coming from a change in libomp version, compiler version, or something else...I think it's ok just to increase the limit in the test. I'll put up a PR to do that.

@shiyu1994
Copy link
Collaborator Author

shiyu1994 commented Aug 19, 2022

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/2889928554

Status: success ✔️.

@jameslamb
Copy link
Collaborator

🎉 glad valgrind is passing now!

whenever @guolinke is able to review and approve this, let's merge it.

@jameslamb
Copy link
Collaborator

I wanted to check that this fixed the warning from #5400, but unfortunately the build logs are already gone from Azure.

image

So I was going to merge master into this and push a commit, but I can't because this is from your fork.

Closing and re-opening to re-trigger CI.

@jameslamb jameslamb closed this Aug 20, 2022
@jameslamb jameslamb reopened this Aug 20, 2022
@jameslamb jameslamb self-requested a review August 20, 2022 04:25
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the logs for several different CI jobs and I don't see any warnings after:

[ 11%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/gbdt_model_text.cpp.o

Thanks for the fix!

@jameslamb jameslamb changed the title [fix] change the destructor of ScoreUpdater to virtual [fix] change the destructor of ScoreUpdater to virtual (fixes #5400) Aug 20, 2022
@jameslamb jameslamb merged commit 865c126 into microsoft:master Aug 20, 2022
@jameslamb jameslamb mentioned this pull request Oct 7, 2022
40 tasks
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
3 participants