Skip to content

Conversation

allenwang28
Copy link
Contributor

Specifically, this PR:

  • Introduces replica, which:
    • handles process lifecycle, async request queueing, fault recovery
    • modularizes it a bit more than the prior implementation
  • Since replica pulls in functionality of RecoverableProcMesh, we remove it altogether
  • Refactors service to use Replica
  • Modifies test_service to be correct

Next PRs:

  • add endpoints to service, to retain the ability to call() and choose()
  • place replicas on their own procs
  • place services on their own procs
  • introduce sharded/distributed services? Or multiple services for envs

@allenwang28 allenwang28 requested a review from Jack-Khuu August 22, 2025 15:49
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 22, 2025
@allenwang28 allenwang28 requested a review from pbontrager August 22, 2025 15:53
Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing, but love it so far

Thanks for doing the refactor

@Jack-Khuu
Copy link
Contributor

Logic looks legit to me. A couple of questions, but we should be ready to go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: couldn't we just set a larger timeout and/or parameterize it? Seems cleaner than a try except for timeout that allows to continue polling.

Copy link
Contributor Author

@allenwang28 allenwang28 Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is parameterized already and we can't set that to infinity because then we won't be able to stop the replica

We shouldn't remove this except but we can set the variable to a high number, it just affects how long it takes to shutdown

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's to avoid a busy-wait (like put back in the queue and check again) which would take 100% of the cycles

But this is a bad approach, I added a semaphore which is a cleaner approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to try to run this for RECOVERING?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I changed it so that it only runs for healthy. If it ever becomes not healthy, the loop stops (can't accept anymore requests) and re-spawns at recovery.

Copy link
Member

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small things to address, but overall lgtm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kewl, I didn't know asyncio had this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question, how would we get here? If the proc mesh creation errors, wouldn't it go to the exception immediately? And isn't that the only wait to have a none for self.proc_mesh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a dumb question - create_proc_mesh would raise the exception but this check is for if a downstream replica implementation (like Policy) doesn't do this correctly - this is another layer of guardrails. But with this current implementation, we will never hit it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be unhealthy or would it just be uninitialized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm good question - uninitialized is like "the service just started and i haven't tried to init" vs unhealthy is like "the service tried to init but couldn't for some reason"

so in this context, unhealthy basically means "this thing is never going to initialize, you should look at your code", but right now I'm not doing anything. We can follow up when this inevitably becomes a problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize with already set this state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this be handled in initialize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this basically just adds another layer of logs saying "oh it failed during recovery, because it failed to initialize"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't wait for this hidden gem to trip some people up haha

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the raw error if you let this through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the service waits forever for a request that will never complete

@allenwang28 allenwang28 merged commit 0681d7a into meta-pytorch:main Aug 25, 2025
4 checks passed
@allenwang28 allenwang28 deleted the replica branch August 25, 2025 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants