-
Notifications
You must be signed in to change notification settings - Fork 46
add failure injector for monarch training script #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Looks like some of the tests/lint is failing?
f"{self.uid} Injecting failure ({failure_type}) into random trainer" | ||
) | ||
|
||
await self.failure_actors.fail.choose(failure_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.choose picks an arbitrary training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.choose picks an arbitrary training?
yup, choose will send it to one random trainer in the replica mesh
hm failing test is unrelated/flaky, I see it on #268 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks
@amirafzali has imported this pull request. If you are a Meta employee, you can view this in D83601242. |
Summary: This introduces a failure injector with 5 failure modes: - SEGFAULT: Triggers a SIGSEGV on the process - DEADLOCK = Deadlocks the GIL, resulting in ProcessGroupNCCL timeout and terminal failure - KILL_PROC: Immediately kills the process with non-zero exit code - COMMS = Forcefully aborts the ProcessGroup and NCCL communicator - KILL_SLURM = Kills a random replica SLURM job It can be enabled with the flag `--with--failure`, and it runs async every 120 seconds. Reviewed By: tushar00jain Differential Revision: D83601242 Pulled By: amirafzali
911abd4
to
fce2791
Compare
@amirafzali has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83601242. |
@amirafzali merged this pull request in d596ec7. |
This introduces a failure injector with 5 failure modes:
It can be enabled with the flag
--with--failure
, and it runs async every 60 seconds.