[monarch] Add message envelope header for no return #1777

samlurye · 2025-11-07T21:27:14Z

Stack from ghstack (oldest at bottom):

-> [monarch] Add message envelope header for no return #1777

Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:

When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.

Differential Revision: D86315780

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]

Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! ghstack-source-id: 321786816 Pull Request resolved: #1777

Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]

…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322252639 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]

…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322253959 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]

…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322255551 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]

…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322489939 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]

…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322512928 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered message back to the sender. Use it with a send of the StopAll message. Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)! [ghstack-poisoned]

…geEnvelope`, and fix race in tensor engine shutdown Pull Request resolved: #1777 Sometimes when we send a message, we want it to be fully fire-and-forget, including if the destination is not even reachable. This is typically only used in scenarios like: * When shutting down the system, we try to ask a process to nicely shut itself down before ungracefully killing it. If the message is undeliverable, we can just proceed with killing the process (it's probably already dead anyways) * Replying to a message. If the sender is down, there's nothing the current actor can do about it This should be used sparingly as it could hide real errors, like your messages not getting sent. This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure. This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent. Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread). ghstack-source-id: 322545046 @exported-using-ghexport Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

meta-codesync · 2025-11-12T11:19:24Z

This pull request has been merged in 6976258.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 7, 2025

meta-codesync bot closed this in 6976258 Nov 12, 2025

facebook-github-bot added the Merged label Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[monarch] Add message envelope header for no return #1777

[monarch] Add message envelope header for no return #1777

Uh oh!

samlurye commented Nov 7, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[monarch] Add message envelope header for no return #1777

[monarch] Add message envelope header for no return #1777

Uh oh!

Conversation

samlurye commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samlurye commented Nov 7, 2025 •

edited

Loading