Skip to content

Conversation

@dulinriley
Copy link
Contributor

Summary:
In cases where a supervision event happens very early in a proc lifecycle, before the
ProcMeshAgent sets the coordinator port, we end up dropping the event details.
Add these to logs so they are seen.

Before we would see:

proc tcp:hostname:port,service: could not propagate supervision event: coordinator port is not set for proc tcp:hostname:port,service: crashing

Now we will also see the original supervision event which caused that crash.

Differential Revision: D85676924

Summary:
In cases where a supervision event happens very early in a proc lifecycle, before the
ProcMeshAgent sets the coordinator port, we end up dropping the event details.
Add these to logs so they are seen.

Before we would see:
```
proc tcp:hostname:port,service: could not propagate supervision event: coordinator port is not set for proc tcp:hostname:port,service: crashing
```

Now we will also see the original supervision event which caused that crash.

Differential Revision: D85676924
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 28, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 28, 2025

@dulinriley has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85676924.

@meta-codesync
Copy link

meta-codesync bot commented Oct 28, 2025

This pull request has been merged in 86fba83.

AlirezaShamsoshoara pushed a commit to AlirezaShamsoshoara/monarch that referenced this pull request Oct 30, 2025
…-pytorch#1685)

Summary:
Pull Request resolved: meta-pytorch#1685

In cases where a supervision event happens very early in a proc lifecycle, before the
ProcMeshAgent sets the coordinator port, we end up dropping the event details.
Add these to logs so they are seen.

Before we would see:
```
proc tcp:hostname:port,service: could not propagate supervision event: coordinator port is not set for proc tcp:hostname:port,service: crashing
```

Now we will also see the original supervision event which caused that crash.

Reviewed By: amirafzali

Differential Revision: D85676924

fbshipit-source-id: 9c3f694be1b9dc25a51728ac313e4504482121e6
@dulinriley dulinriley deleted the export-D85676924 branch November 26, 2025 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants