[Service] Refactors service to align with EX545081 #65

allenwang28 · 2025-08-22T15:49:17Z

Specifically, this PR:

Introduces replica, which:
- handles process lifecycle, async request queueing, fault recovery
- modularizes it a bit more than the prior implementation
Since replica pulls in functionality of RecoverableProcMesh, we remove it altogether
Refactors service to use Replica
Modifies test_service to be correct

Next PRs:

add endpoints to service, to retain the ability to call() and choose()
place replicas on their own procs
place services on their own procs
introduce sharded/distributed services? Or multiple services for envs

Jack-Khuu

Still reviewing, but love it so far

Thanks for doing the refactor

src/forge/controller/replica.py

Jack-Khuu · 2025-08-22T22:17:12Z

Logic looks legit to me. A couple of questions, but we should be ready to go

joecummings · 2025-08-25T15:56:51Z

src/forge/controller/replica.py

nit: couldn't we just set a larger timeout and/or parameterize it? Seems cleaner than a try except for timeout that allows to continue polling.

The timeout is parameterized already and we can't set that to infinity because then we won't be able to stop the replica

We shouldn't remove this except but we can set the variable to a high number, it just affects how long it takes to shutdown

joecummings · 2025-08-25T15:57:28Z

src/forge/controller/replica.py

Why is this needed?

it's to avoid a busy-wait (like put back in the queue and check again) which would take 100% of the cycles

But this is a bad approach, I added a semaphore which is a cleaner approach

joecummings · 2025-08-25T15:57:56Z

src/forge/controller/replica.py

Why do we want to try to run this for RECOVERING?

Good catch, I changed it so that it only runs for healthy. If it ever becomes not healthy, the loop stops (can't accept anymore requests) and re-spawns at recovery.

src/forge/controller/replica.py

joecummings

Some small things to address, but overall lgtm

joecummings · 2025-08-25T19:38:46Z

src/forge/controller/replica.py

kewl, I didn't know asyncio had this

joecummings · 2025-08-25T19:49:01Z

src/forge/controller/replica.py

Dumb question, how would we get here? If the proc mesh creation errors, wouldn't it go to the exception immediately? And isn't that the only wait to have a none for self.proc_mesh?

not a dumb question - create_proc_mesh would raise the exception but this check is for if a downstream replica implementation (like Policy) doesn't do this correctly - this is another layer of guardrails. But with this current implementation, we will never hit it

joecummings · 2025-08-25T19:50:36Z

src/forge/controller/replica.py

Would this be unhealthy or would it just be uninitialized?

hmm good question - uninitialized is like "the service just started and i haven't tried to init" vs unhealthy is like "the service tried to init but couldn't for some reason"

so in this context, unhealthy basically means "this thing is never going to initialize, you should look at your code", but right now I'm not doing anything. We can follow up when this inevitably becomes a problem

joecummings · 2025-08-25T19:51:42Z

src/forge/controller/replica.py

Initialize with already set this state.

joecummings · 2025-08-25T19:52:00Z

src/forge/controller/replica.py

Won't this be handled in initialize?

yeah this basically just adds another layer of logs saying "oh it failed during recovery, because it failed to initialize"

joecummings · 2025-08-25T19:52:40Z

src/forge/controller/replica.py

Can't wait for this hidden gem to trip some people up haha

joecummings · 2025-08-25T19:53:18Z

src/forge/controller/replica.py

What's the raw error if you let this through?

the service waits forever for a request that will never complete

Allen Wang added 9 commits August 21, 2025 12:43

initial commit for replica

dd2cfa8

clean up

d3677ba

phase out service for service v2

d4f5660

remove v2

4202883

remove v2 from spawn

efe1806

more minor cleanups

7d6b247

Merge branch 'main' into replica

2054d63

remove comment

8392ae6

remove comment

41d71da

allenwang28 requested a review from Jack-Khuu August 22, 2025 15:49

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 22, 2025

allenwang28 requested a review from pbontrager August 22, 2025 15:53

allenwang28 mentioned this pull request Aug 22, 2025

[do not review] adds service endpoint #66

Closed

Jack-Khuu reviewed Aug 22, 2025

View reviewed changes

simplify and unify replica initialization

0de554b

Jack-Khuu reviewed Aug 22, 2025

View reviewed changes

Jack-Khuu approved these changes Aug 22, 2025

View reviewed changes

joecummings reviewed Aug 25, 2025

View reviewed changes

Allen Wang added 4 commits August 25, 2025 09:59

address comments

f142250

address comments

a2e58a9

add capacity semaphore

2ccdcb1

f-strings

f27b353

joecummings reviewed Aug 25, 2025

View reviewed changes

remove redundant health set

72e42c6

allenwang28 merged commit 0681d7a into meta-pytorch:main Aug 25, 2025
4 checks passed

allenwang28 deleted the replica branch August 25, 2025 20:15

[Service] Refactors service to align with EX545081 #65

[Service] Refactors service to align with EX545081 #65

Uh oh!

Conversation

allenwang28 commented Aug 22, 2025

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jack-Khuu commented Aug 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allenwang28 Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

allenwang28 Aug 25, 2025 •

edited

Loading