Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[possibly issue] Watchdog detected hard LOCKUP #431

Closed
henry0312 opened this issue Apr 18, 2017 · 19 comments · Fixed by #440
Closed

[possibly issue] Watchdog detected hard LOCKUP #431

henry0312 opened this issue Apr 18, 2017 · 19 comments · Fixed by #440

Comments

@henry0312
Copy link
Contributor

henry0312 commented Apr 18, 2017

After 66b7f03, my system becomes to be crashed while running heavy task (using 20 threads for learning for several hours).
A kernel log says "NMI watchdog: Watchdog detected hard LOCKUP on cpu 10".
I'm not sure that the commit has a serious bug, but I would appreciate it if you could look into that for me.

Environment info

Operating System: Ubuntu 16.04.2 (kernel 4.4.0-71-generic and 4.4.0-72-generic)
CPU: Intel Core i7-6950X
C++/Python/R version: Python 3.5.3

@henry0312 henry0312 changed the title [Perhaps issue] Watchdog detected hard LOCKUP [possibly issue] Watchdog detected hard LOCKUP Apr 18, 2017
@guolinke
Copy link
Collaborator

guolinke commented Apr 19, 2017

@henry0312
Not sure about the reason.
can you try remove march=core2 mtune=native in this line (https://github.com/Microsoft/LightGBM/blob/66b7f03238387951e6a0ba3060353ae098691fbc/CMakeLists.txt#L50) ?

@henry0312
Copy link
Contributor Author

henry0312 commented Apr 19, 2017

Thank you for your response.
I'll try to remove march=core2 mtune=native in CMakeLists.txt

@guolinke
Copy link
Collaborator

@henry0312 any updates?

@henry0312
Copy link
Contributor Author

I'm sorry for late response.
The problem haven't been happend after removing march=core2 mtune=native, altough it may occur after a while....

@guolinke
Copy link
Collaborator

@henry0312
sorry. so it still will happen after removing march=core2 mtune=native ?

@henry0312
Copy link
Contributor Author

henry0312 commented Apr 21, 2017

Maybe yes.
The problem haven't occured so far.
However, I can't inform you that it is fixed, because I don't know when it happens 😢

@henry0312
Copy link
Contributor Author

henry0312 commented Apr 21, 2017

Anyway, I believe that it is good to remove march=core2 mtune=native from CMakeLists, because -O3 is often sufficient to optimize.

@Laurae2
Copy link
Contributor

Laurae2 commented Apr 21, 2017

@henry0312 It depends. LightGBM v2 benefits (conditionally) a bit from using native, you can check here: #348

The difference is about 1%, which is large when taking into account trainings which last hours/days.

@guolinke I think we can remove -march=core2 -mtune=native currently if it seems to (potentially) cause issues, and I'll make a PR about whether to add them or not on the weekend of 6-7 May (I'll drop more details about when to use them specifically for LightGBM, as there seems to be conditions in which it increases performance, but can also decreases performance).

@henry0312
Copy link
Contributor Author

henry0312 commented Apr 21, 2017

We shuldn't provide optimization depend on cpu architecture in compile.
For example, it is not good for AMD users to optimize with the way for Intel Core 2.
Just users should choose thier own best way.

@Laurae2
Copy link
Contributor

Laurae2 commented Apr 21, 2017

@henry0312 If your CPU doesn't have Intel Core 2 architecture instructions, then your CPU must be extremely old (so old that it is older than a Bobcat, which is AMD architecture name in 2011).

march does not optimize for specific CPUs (Intel / AMD brand name does not matter at all), it optimizes for a specific set of instructions.

And it is beneficial for AMD to optimize the way for Intel Core 2, than just nothing (brand name does not matter). If it was untrue, this would be the same as saying AVX2 instructions are useless on Haswell and Excavator (AVX instructions are the most important breakthrough CPU instructions for Data Science, and there's a reason Intel has a separate clock rate on their CPUs dedicated to AVX instructions).

@henry0312
Copy link
Contributor Author

@Laurae2 Thank you for your explaining. It seems that I misunderstood some of optimization of gcc.
Howerver, I don't think it is the best to use -march=core2 for AMD Ryzen (for instance), therefore I believe users who need some optimations should choose the best way depending on thier situations.

@Laurae2
Copy link
Contributor

Laurae2 commented Apr 22, 2017

@henry0312 -march=core2 -mtune=native for <your CPU> microarchitecture (like AMD Ryzen) automatically tunes for <your CPU> microarchitecture (like AMD Ryzen) while keeping compatibility to any CPU satisfying the minimal requirements of Core 2 microarchitecture (which literally means nearly every x86_64 CPU from 2011, i.e from Core 2 for Intel and Bobcat for AMD).

If you run LightGBM on <your CPU> microarchitecture (like AMD Ryzen), it will be as if you compiled the code for -march=native. If you use the code on an older CPU which does not have support for the full set of instructions for <your CPU> microarchitecture (like AMD Ryzen), it will branch conditionally parts of the code to fallback on -march=core2 code when your CPU cannot use -mtune=native code.

It is the best way to maintain compatibility (exploiting only MMX + SSE (1, 2, 3, E3) instructions) while getting a bit more performance for any CPU (exploiting ALL the features of your CPU), unless the additional instructions are useless for the code to compile for the compiler.

@Laurae2
Copy link
Contributor

Laurae2 commented Apr 24, 2017

I can confirm this is not related to -march nor -mtune arguments, as I did encounter that crash myself on #448, but using OpenCL, with only -O2.

@guolinke
Copy link
Collaborator

@Laurae2
It is weird, the 66b7f03 didn't have much changes.
Did you run in ubuntu as well ? the same error ?

@Laurae2
Copy link
Contributor

Laurae2 commented Apr 24, 2017

Virtualized Ubuntu 16.04 (baremetal host = Ubuntu 16.04), same error but different CPU. It happened only once and I had to fully wipe my server to get back control over it (it would not boot anymore).

Note: it crashed both the virtual machine + the host at the same time, not only the virtual machine. I noticed 40 threads with a large maximum number of leaves can result in awfully slow computer on rare occasions (so slow it becomes hardly responsive - this happens very randomly) on CPU only.

@guolinke
Copy link
Collaborator

@Laurae2
I think it is some kind of system error. Not sure why it happens.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1596866

@henry0312
Copy link
Contributor Author

Unfortunately, the same error happened a day ago, and thus march=core2 mtune=native are not related to this issue.
However, the problem haven't happend since I disabled Intel® Turbo Boost Max Technology 3.0, which is suppored from Ubuntu 17.04 (with Linux kernel 4.10).
(cf. https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.10-Scheduler-TBM3)

@henry0312
Copy link
Contributor Author

However, in my environment, I confirmed that this problem happend on Ubuntu 17.04 with Intel Turbo Boost Max Technology 3.0 😢

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants