Skip to content

Conversation

@samlurye
Copy link
Contributor

@samlurye samlurye commented Nov 7, 2025

Stack from ghstack (oldest at bottom):

Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:

  • When shutting down the system, we try to ask a process to nicely shut itself down
    before ungracefully killing it. If the message is undeliverable, we can just proceed with
    killing the process (it's probably already dead anyways)
  • Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.

Differential Revision: D86315780

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

[ghstack-poisoned]
samlurye added a commit that referenced this pull request Nov 7, 2025
Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

ghstack-source-id: 321786816
Pull Request resolved: #1777
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025
Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

[ghstack-poisoned]
samlurye added a commit that referenced this pull request Nov 10, 2025
…geEnvelope`, and fix race in tensor engine shutdown

Pull Request resolved: #1777

Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure.

This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent.

Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread).

ghstack-source-id: 322252639
@exported-using-ghexport

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

[ghstack-poisoned]
samlurye added a commit that referenced this pull request Nov 10, 2025
…geEnvelope`, and fix race in tensor engine shutdown

Pull Request resolved: #1777

Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure.

This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent.

Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread).

ghstack-source-id: 322253959
@exported-using-ghexport

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

[ghstack-poisoned]
samlurye added a commit that referenced this pull request Nov 10, 2025
…geEnvelope`, and fix race in tensor engine shutdown

Pull Request resolved: #1777

Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure.

This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent.

Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread).

ghstack-source-id: 322255551
@exported-using-ghexport

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

[ghstack-poisoned]
samlurye added a commit that referenced this pull request Nov 11, 2025
…geEnvelope`, and fix race in tensor engine shutdown

Pull Request resolved: #1777

Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure.

This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent.

Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread).

ghstack-source-id: 322489939
@exported-using-ghexport

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

[ghstack-poisoned]
samlurye added a commit that referenced this pull request Nov 11, 2025
…geEnvelope`, and fix race in tensor engine shutdown

Pull Request resolved: #1777

Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure.

This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent.

Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread).

ghstack-source-id: 322512928
@exported-using-ghexport

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

Add a header to MessageEnvelope for this use case, which avoids posting the Undelivered
message back to the sender.
Use it with a send of the StopAll message.

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!

[ghstack-poisoned]
samlurye added a commit that referenced this pull request Nov 11, 2025
…geEnvelope`, and fix race in tensor engine shutdown

Pull Request resolved: #1777

Sometimes when we send a message, we want it to be fully fire-and-forget,
including if the destination is not even reachable. This is typically only used in
scenarios like:
* When shutting down the system, we try to ask a process to nicely shut itself down
before ungracefully killing it. If the message is undeliverable, we can just proceed with
killing the process (it's probably already dead anyways)
* Replying to a message. If the sender is down, there's nothing the current actor can do about it

This should be used sparingly as it could hide real errors, like your messages not getting sent.

This diff adds a `return_undeliverable: bool` property on `MessageEnvelope` and `PortRef`. When the property is set on `PortRef`, any `MessageEnvelope` sent via that `PortRef` will have an equivalent value for `return_undeliverable`. Any envelope with `return_undeliverable == true` will not be returned to its sender on delivery failure.

This is useful for messages like `GetRankStatus` and `GetState`, where the receiver shouldn't fail its reply fails to be delivered. It is also useful during proc termination, when the host mesh agent sends `StopAll` to the proc mesh agent; if the proc mesh agent is already dead, the message won't be delivered, but that shouldn't crash the host mesh agent.

Unrelatedly, this diff also fixes a race condition with host/proc mesh shutdown vs. tensor engine shutdown. Basically, `DeviceMesh.exit` was sending a fire-and-forget `WorkerMessage::Exit` via `Controller.drain_and_stop()`. But if you simultaneously try to shut down the host/proc mesh, then the worker exit message might fail to deliver, crashing the process. With this diff, `Controller.drain_and_stop()` synchronously calls `ActorMesh::stop` on the worker actor mesh so that there can't be a race with host/proc mesh shutdown (at least not from the same thread).

ghstack-source-id: 322545046
@exported-using-ghexport

Differential Revision: [D86315780](https://our.internmc.facebook.com/intern/diff/D86315780/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D86315780/)!
@meta-codesync meta-codesync bot closed this in 6976258 Nov 12, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 12, 2025

This pull request has been merged in 6976258.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants