-
Notifications
You must be signed in to change notification settings - Fork 109
[monarch] Add message envelope header for no return #1777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]
samlurye
added a commit
that referenced
this pull request
Nov 7, 2025
Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! ghstack-source-id: 321786816 Pull Request resolved: #1777
Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]
samlurye
added a commit
that referenced
this pull request
Nov 10, 2025
…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322252639 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]
samlurye
added a commit
that referenced
this pull request
Nov 10, 2025
…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322253959 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]
samlurye
added a commit
that referenced
this pull request
Nov 10, 2025
…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322255551 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]
samlurye
added a commit
that referenced
this pull request
Nov 11, 2025
…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322489939 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]
samlurye
added a commit
that referenced
this pull request
Nov 11, 2025
…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322512928 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]
samlurye
added a commit
that referenced
this pull request
Nov 11, 2025
…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322545046 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
|
This pull request has been merged in 6976258. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
This should be used sparingly as it could hide real errors, like your messages not getting sent.
Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.
Differential Revision: D86315780
NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!