Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch BaseException in host discovery thread to prevent silently dying #3436

Merged
merged 1 commit into from Mar 30, 2022

Conversation

ASDen
Copy link
Contributor

@ASDen ASDen commented Mar 1, 2022

When any error other than a RuntimeError occurs, the host discovery thread silently dies, and the training process is unaware of any changes in the list of hosts.

Fix this by catching any BaseException instead.

Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @ASDen,

thanks for you contribution! Note that you would need to sign the Developer Certificate of Origin by signing off your commit.

I'm not very familiar with the elastic code base (@tgaddair might know more about the error handling here), but BaseException feels awfully broad.

except RuntimeError as e:
except BaseException as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to catch Exception here, rather than BaseException?

BaseException would also include SystemExit and KeyboardInterrupt and prevent the Python interpreter from shutting down. (Going by https://docs.python.org/3/library/exceptions.html#exception-hierarchy)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think RuntimeError here is ok, as it represents a retry-able exception. I agree that other exceptions should not kill the discovery thread. Instead of broadening the exception here, I would add a broader exception that always calls self._shutdown.set() to gracefully shutdown the Horovod job.

Then, the host manager and _notify_workers_host_changes should only raise RuntimeError for transient errors or some other exception for fatal errors.

I'd be more interested in the particular error that killed the discovery thread on your side (it must be in the logs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks alot for having a look !

@maxhgerlach This shouldn't prevent the interpreter from shutting down, since this thread is marked as daemon here. However, I think Exception should also work.

@EnricoMi I think that in our case here, given that we have the host discovery thread repeatedly executing an arbitrary (possibly user defined) host discovery function that can have any random glitch for a few executions cycles then returns to normal, it makes a lot of sense to just log the error (whatever happened) to user, and continue normal

I would disagree with shutting down the whole training session, for an error (whatever its source) in the host discovery function.

Also, RuntimeError is far from being the only retry-able exception, maybe Exception mentioned by @maxhgerlach. But there are many retry-able exceptions that are not RuntimeError (e.g. ZeroDivisionError, FileNotFoundError, TimeoutError ,...etc)

I was personally hit by this in the form of a ImportError (inside a custom discovery class) that silently killed the host discovery thread (and it was awful to debug :(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds as if Exception vs BaseException wouldn't make much of a difference here, so no objections from me

@ASDen
Copy link
Contributor Author

ASDen commented Mar 14, 2022

Hi @EnricoMi @maxhgerlach any updates on this ?
I have another bug fix awaiting review here #3437, I have also another couple Elastic bug fixes that I would like to submit if maintainers are keen to have a look

Thanks,

@EnricoMi
Copy link
Collaborator

@ASDen please see the DCO issue and follow these instructions: https://github.com/horovod/horovod/pull/3436/checks?check_run_id=5376922918

@EnricoMi EnricoMi changed the title Catch any type of exception to prevent the host discovery thread from silently dying Catch any kind of BaseException in host discovery thread to prevent silently dying Mar 16, 2022
@EnricoMi EnricoMi changed the title Catch any kind of BaseException in host discovery thread to prevent silently dying Catch BaseException in host discovery thread to prevent silently dying Mar 16, 2022
@github-actions
Copy link

github-actions bot commented Mar 17, 2022

Unit Test Results

     802 files   -   28       802 suites   - 28   9h 48m 41s ⏱️ - 10m 10s
     757 tests ±    0       714 ✔️ ±    0       43 💤 ±    0  0 ±0 
18 684 runs   - 732  13 401 ✔️  - 468  5 283 💤  - 264  0 ±0 

Results for commit c9ee58a. ± Comparison against base commit 980ce05.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Mar 17, 2022

Unit Test Results (with flaky tests)

     932 files  +  12       932 suites  +12   10h 31m 10s ⏱️ + 6m 55s
     757 tests ±    0       712 ✔️  -     1       43 💤 ±  0  1 ±0  1 🔥 +1 
21 822 runs  +270  15 395 ✔️ +240  6 425 💤 +29  1 ±0  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit c9ee58a. ± Comparison against base commit 980ce05.

♻️ This comment has been updated with latest results.

Signed-off-by: Mohamed Yousef <myb@imachines.com>
@ASDen
Copy link
Contributor Author

ASDen commented Mar 19, 2022

@EnricoMi Done!

@maxhgerlach maxhgerlach merged commit afb0497 into horovod:master Mar 30, 2022
@maxhgerlach
Copy link
Collaborator

I finally clicked that merge button (assuming that that test failure was caused by infrastructure flakiness). Sorry for the delay, @ASDen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants