-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
show specific error message in TCP accept/send/receive logs #4128
Conversation
#if defined(_WIN32) | ||
Log::Fatal("Socket accept error (code: %d)", err_code); | ||
#else | ||
Log::Fatal("Socket accept error, %s (code: %d)", std::strerror(err_code), err_code); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we switch to a thread-safe version of strerror
, (strerror_r
for non-windows and strerror_s
for windows)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shiyu1994 sorry for the delay, I'm back to this.
Will that cause portability problems?
I saw in https://en.cppreference.com/w/c/string/byte/strerror that those functions will only work when __STDC_LIB_EXT1__
is defined.
As with all bounds-checked functions, strerror_s and strerrorlen_s are only guaranteed to be available if STDC_LIB_EXT1 is defined by the implementation and if the user defines STDC_WANT_LIB_EXT1 to the integer constant 1 before including string.h.
And when I researched a little bit about support for that, I found some posts that suggest that those aren't likely to be supported by default in many setups.
- (Aug 2017) https://stackoverflow.com/a/45578649/3986677
- (Dec 2017) https://stackoverflow.com/questions/47867130/stdc-lib-ext1-availability-in-gcc-and-clang
- (Jun 2018) https://stackoverflow.com/questions/50724726/why-didnt-gcc-or-glibc-implement-s-functions
- (Dec 2020) https://stackoverflow.com/questions/65471315/scanf-s-is-not-included-in-c11
- (Dec 2020) https://stackoverflow.com/a/65471801/3986677
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So maybe keep the current change is the best option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks very much for bringing it up! I learned new things reading about this.
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Today in LightGBM, when
send
,receive
, oraccept
operations fail in socket-based distributed training, a FATAL-level log message is printed with the integer code for that error.For example, you might get a message like those that have been showing up in the Dask tests (#4074, #4116)
This PR proposes changing those log messages to include the corresponding error text. Using the reproducible example from #4074 (comment), for example, you can see that after this change the error message I mentioned would become:
How this improves
lightgbm
If training fails with such an error, users who are not experienced C/C++ programmers will likely not know what a code like 104 actually means. For example, it took me some significant effort to figure out what that error meant while reviewing @ffineis 's proposal for adding early stopping in
lightgbm.dask
: #3952 (review).I think that including the corresponding error text would help such users in debugging. This is more relevant now than it was in the recent past, since
lightgbm.dask
makes distributed LightGBM training accessible to people who are most comfortable working in Python.Notes for reviewers
@
-ing @imatiach-msft for visibility since I'm guessing this change would also impact MMLSpark.