New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch BaseException in host discovery thread to prevent silently dying #3436
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI @ASDen,
thanks for you contribution! Note that you would need to sign the Developer Certificate of Origin by signing off your commit.
I'm not very familiar with the elastic code base (@tgaddair might know more about the error handling here), but BaseException
feels awfully broad.
except RuntimeError as e: | ||
except BaseException as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to catch Exception
here, rather than BaseException
?
BaseException
would also include SystemExit
and KeyboardInterrupt
and prevent the Python interpreter from shutting down. (Going by https://docs.python.org/3/library/exceptions.html#exception-hierarchy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think RuntimeError
here is ok, as it represents a retry-able exception. I agree that other exceptions should not kill the discovery thread. Instead of broadening the exception here, I would add a broader exception that always calls self._shutdown.set()
to gracefully shutdown the Horovod job.
Then, the host manager and _notify_workers_host_changes
should only raise RuntimeError
for transient errors or some other exception for fatal errors.
I'd be more interested in the particular error that killed the discovery thread on your side (it must be in the logs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks alot for having a look !
@maxhgerlach This shouldn't prevent the interpreter from shutting down, since this thread is marked as daemon here. However, I think Exception
should also work.
@EnricoMi I think that in our case here, given that we have the host discovery thread repeatedly executing an arbitrary (possibly user defined) host discovery function that can have any random glitch for a few executions cycles then returns to normal, it makes a lot of sense to just log the error (whatever happened) to user, and continue normal
I would disagree with shutting down the whole training session, for an error (whatever its source) in the host discovery function.
Also, RuntimeError
is far from being the only retry-able exception, maybe Exception
mentioned by @maxhgerlach. But there are many retry-able exceptions that are not RuntimeError
(e.g. ZeroDivisionError
, FileNotFoundError
, TimeoutError
,...etc)
I was personally hit by this in the form of a ImportError
(inside a custom discovery class) that silently killed the host discovery thread (and it was awful to debug :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds as if Exception
vs BaseException
wouldn't make much of a difference here, so no objections from me
Hi @EnricoMi @maxhgerlach any updates on this ? Thanks, |
@ASDen please see the DCO issue and follow these instructions: https://github.com/horovod/horovod/pull/3436/checks?check_run_id=5376922918 |
Unit Test Results (with flaky tests) 932 files + 12 932 suites +12 10h 31m 10s ⏱️ + 6m 55s For more details on these failures and errors, see this check. Results for commit c9ee58a. ± Comparison against base commit 980ce05. ♻️ This comment has been updated with latest results. |
Signed-off-by: Mohamed Yousef <myb@imachines.com>
@EnricoMi Done! |
I finally clicked that merge button (assuming that that test failure was caused by infrastructure flakiness). Sorry for the delay, @ASDen. |
When any error other than a
RuntimeError
occurs, the host discovery thread silently dies, and the training process is unaware of any changes in the list of hosts.Fix this by catching any
BaseException
instead.