Catch BaseException in host discovery thread to prevent silently dying #3436

ASDen · 2022-03-01T14:40:58Z

When any error other than a RuntimeError occurs, the host discovery thread silently dies, and the training process is unaware of any changes in the list of hosts.

Fix this by catching any BaseException instead.

maxhgerlach

HI @ASDen,

thanks for you contribution! Note that you would need to sign the Developer Certificate of Origin by signing off your commit.

I'm not very familiar with the elastic code base (@tgaddair might know more about the error handling here), but BaseException feels awfully broad.

maxhgerlach · 2022-03-02T17:34:56Z

horovod/runner/elastic/driver.py

-            except RuntimeError as e:
+            except BaseException as e:


Wouldn't it be better to catch Exception here, rather than BaseException?

BaseException would also include SystemExit and KeyboardInterrupt and prevent the Python interpreter from shutting down. (Going by https://docs.python.org/3/library/exceptions.html#exception-hierarchy)

I think RuntimeError here is ok, as it represents a retry-able exception. I agree that other exceptions should not kill the discovery thread. Instead of broadening the exception here, I would add a broader exception that always calls self._shutdown.set() to gracefully shutdown the Horovod job.

Then, the host manager and _notify_workers_host_changes should only raise RuntimeError for transient errors or some other exception for fatal errors.

I'd be more interested in the particular error that killed the discovery thread on your side (it must be in the logs).

Thanks alot for having a look !

@maxhgerlach This shouldn't prevent the interpreter from shutting down, since this thread is marked as daemon here. However, I think Exception should also work.

@EnricoMi I think that in our case here, given that we have the host discovery thread repeatedly executing an arbitrary (possibly user defined) host discovery function that can have any random glitch for a few executions cycles then returns to normal, it makes a lot of sense to just log the error (whatever happened) to user, and continue normal

I would disagree with shutting down the whole training session, for an error (whatever its source) in the host discovery function.

Also, RuntimeError is far from being the only retry-able exception, maybe Exception mentioned by @maxhgerlach. But there are many retry-able exceptions that are not RuntimeError (e.g. ZeroDivisionError, FileNotFoundError, TimeoutError ,...etc)

I was personally hit by this in the form of a ImportError (inside a custom discovery class) that silently killed the host discovery thread (and it was awful to debug :(

sounds as if Exception vs BaseException wouldn't make much of a difference here, so no objections from me

ASDen · 2022-03-14T08:03:30Z

Hi @EnricoMi @maxhgerlach any updates on this ?
I have another bug fix awaiting review here #3437, I have also another couple Elastic bug fixes that I would like to submit if maintainers are keen to have a look

Thanks,

EnricoMi · 2022-03-16T22:05:55Z

@ASDen please see the DCO issue and follow these instructions: https://github.com/horovod/horovod/pull/3436/checks?check_run_id=5376922918

github-actions · 2022-03-17T04:15:36Z

Unit Test Results

    802 files -   28     802 suites - 28 9h 48m 41s ⏱️ - 10m 10s
    757 tests ±    0     714 ✔️ ±    0     43 💤 ±    0 0 ❌ ±0
18 684 runs - 732 13 401 ✔️ - 468 5 283 💤 - 264 0 ❌ ±0

Results for commit c9ee58a. ± Comparison against base commit 980ce05.

♻️ This comment has been updated with latest results.

github-actions · 2022-03-17T04:15:51Z

Unit Test Results (with flaky tests)

    932 files +  12     932 suites +12 10h 31m 10s ⏱️ + 6m 55s
    757 tests ±    0     712 ✔️ -     1     43 💤 ±  0 1 ❌ ±0 1 🔥 +1
21 822 runs +270 15 395 ✔️ +240 6 425 💤 +29 1 ❌ ±0 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit c9ee58a. ± Comparison against base commit 980ce05.

♻️ This comment has been updated with latest results.

Signed-off-by: Mohamed Yousef <myb@imachines.com>

ASDen · 2022-03-19T18:44:54Z

@EnricoMi Done!

maxhgerlach · 2022-03-30T08:15:03Z

I finally clicked that merge button (assuming that that test failure was caused by infrastructure flakiness). Sorry for the delay, @ASDen.

maxhgerlach reviewed Mar 2, 2022

View reviewed changes

EnricoMi approved these changes Mar 16, 2022

View reviewed changes

EnricoMi changed the title ~~Catch any type of exception to prevent the host discovery thread from silently dying~~ Catch any kind of BaseException in host discovery thread to prevent silently dying Mar 16, 2022

EnricoMi changed the title ~~Catch any kind of BaseException in host discovery thread to prevent silently dying~~ Catch BaseException in host discovery thread to prevent silently dying Mar 16, 2022

ASDen force-pushed the patch-1 branch from 7730507 to e39ab13 Compare March 19, 2022 17:27

Catch any exception to prevent the discovery thread from silently dying

c9ee58a

Signed-off-by: Mohamed Yousef <myb@imachines.com>

ASDen force-pushed the patch-1 branch from e39ab13 to c9ee58a Compare March 19, 2022 18:43

maxhgerlach merged commit afb0497 into horovod:master Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch BaseException in host discovery thread to prevent silently dying #3436

Catch BaseException in host discovery thread to prevent silently dying #3436

ASDen commented Mar 1, 2022 •

edited by EnricoMi

Loading

maxhgerlach left a comment

maxhgerlach Mar 2, 2022

EnricoMi Mar 2, 2022

ASDen Mar 3, 2022

EnricoMi Mar 16, 2022

maxhgerlach Mar 17, 2022

ASDen commented Mar 14, 2022

EnricoMi commented Mar 16, 2022

github-actions bot commented Mar 17, 2022 •

edited

Loading

github-actions bot commented Mar 17, 2022 •

edited

Loading

ASDen commented Mar 19, 2022

maxhgerlach commented Mar 30, 2022

Catch BaseException in host discovery thread to prevent silently dying #3436

Catch BaseException in host discovery thread to prevent silently dying #3436

Conversation

ASDen commented Mar 1, 2022 • edited by EnricoMi Loading

maxhgerlach left a comment

Choose a reason for hiding this comment

maxhgerlach Mar 2, 2022

Choose a reason for hiding this comment

EnricoMi Mar 2, 2022

Choose a reason for hiding this comment

ASDen Mar 3, 2022

Choose a reason for hiding this comment

EnricoMi Mar 16, 2022

Choose a reason for hiding this comment

maxhgerlach Mar 17, 2022

Choose a reason for hiding this comment

ASDen commented Mar 14, 2022

EnricoMi commented Mar 16, 2022

github-actions bot commented Mar 17, 2022 • edited Loading

Unit Test Results

github-actions bot commented Mar 17, 2022 • edited Loading

Unit Test Results (with flaky tests)

ASDen commented Mar 19, 2022

maxhgerlach commented Mar 30, 2022

ASDen commented Mar 1, 2022 •

edited by EnricoMi

Loading

github-actions bot commented Mar 17, 2022 •

edited

Loading

github-actions bot commented Mar 17, 2022 •

edited

Loading