Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

show specific error message in TCP accept/send/receive logs #4128

Merged
merged 1 commit into from
Apr 29, 2021

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Mar 28, 2021

Today in LightGBM, when send, receive, or accept operations fail in socket-based distributed training, a FATAL-level log message is printed with the integer code for that error.

For example, you might get a message like those that have been showing up in the Dask tests (#4074, #4116)

lightgbm.basic.LightGBMError: Socket recv error, code: 104

This PR proposes changing those log messages to include the corresponding error text. Using the reproducible example from #4074 (comment), for example, you can see that after this change the error message I mentioned would become:

LightGBMError: Socket recv error, Connection reset by peer (code: 104)

image

How this improves lightgbm

If training fails with such an error, users who are not experienced C/C++ programmers will likely not know what a code like 104 actually means. For example, it took me some significant effort to figure out what that error meant while reviewing @ffineis 's proposal for adding early stopping in lightgbm.dask: #3952 (review).

I think that including the corresponding error text would help such users in debugging. This is more relevant now than it was in the recent past, since lightgbm.dask makes distributed LightGBM training accessible to people who are most comfortable working in Python.

Notes for reviewers

  • I'm only proposing this change for non-Windows systems, since I don't have easy access to a way to test distributed training on Windows, and since the equivalent process of converting error codes to error messages on Windows seems a bit more involved: https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-formatmessage. If reviewers agree with this PR, I'll write up a feature request for adding similar messages for distributed training on Windows.
  • I can't think of a reliable way to simulate the types of connectivity problems that would trigger these messages, so I couldn't come up with automated tests for this change. Maybe such tests could be added in Write tests for parallel code #3841 .
  • @-ing @imatiach-msft for visibility since I'm guessing this change would also impact MMLSpark.

#if defined(_WIN32)
Log::Fatal("Socket accept error (code: %d)", err_code);
#else
Log::Fatal("Socket accept error, %s (code: %d)", std::strerror(err_code), err_code);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we switch to a thread-safe version of strerror, (strerror_r for non-windows and strerror_s for windows)?

Copy link
Collaborator Author

@jameslamb jameslamb Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 sorry for the delay, I'm back to this.

Will that cause portability problems?

I saw in https://en.cppreference.com/w/c/string/byte/strerror that those functions will only work when __STDC_LIB_EXT1__ is defined.

As with all bounds-checked functions, strerror_s and strerrorlen_s are only guaranteed to be available if STDC_LIB_EXT1 is defined by the implementation and if the user defines STDC_WANT_LIB_EXT1 to the integer constant 1 before including string.h.

And when I researched a little bit about support for that, I found some posts that suggest that those aren't likely to be supported by default in many setups.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So maybe keep the current change is the best option.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks very much for bringing it up! I learned new things reading about this.

@shiyu1994 shiyu1994 merged commit f97aa86 into master Apr 29, 2021
@jameslamb jameslamb deleted the network-logging branch April 29, 2021 02:22
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants