-
Notifications
You must be signed in to change notification settings - Fork 24
[Service] Refactors service to align with EX545081 #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing, but love it so far
Thanks for doing the refactor
Logic looks legit to me. A couple of questions, but we should be ready to go |
src/forge/controller/replica.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: couldn't we just set a larger timeout and/or parameterize it? Seems cleaner than a try except for timeout that allows to continue polling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout is parameterized already and we can't set that to infinity because then we won't be able to stop the replica
We shouldn't remove this except but we can set the variable to a high number, it just affects how long it takes to shutdown
src/forge/controller/replica.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's to avoid a busy-wait (like put back in the queue and check again) which would take 100% of the cycles
But this is a bad approach, I added a semaphore which is a cleaner approach
src/forge/controller/replica.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to try to run this for RECOVERING?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I changed it so that it only runs for healthy. If it ever becomes not healthy, the loop stops (can't accept anymore requests) and re-spawns at recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small things to address, but overall lgtm
src/forge/controller/replica.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kewl, I didn't know asyncio had this
src/forge/controller/replica.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question, how would we get here? If the proc mesh creation errors, wouldn't it go to the exception immediately? And isn't that the only wait to have a none for self.proc_mesh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a dumb question - create_proc_mesh would raise the exception but this check is for if a downstream replica implementation (like Policy) doesn't do this correctly - this is another layer of guardrails. But with this current implementation, we will never hit it
src/forge/controller/replica.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be unhealthy or would it just be uninitialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm good question - uninitialized is like "the service just started and i haven't tried to init" vs unhealthy is like "the service tried to init but couldn't for some reason"
so in this context, unhealthy basically means "this thing is never going to initialize, you should look at your code", but right now I'm not doing anything. We can follow up when this inevitably becomes a problem
src/forge/controller/replica.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize with already set this state.
src/forge/controller/replica.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this be handled in initialize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this basically just adds another layer of logs saying "oh it failed during recovery, because it failed to initialize"
src/forge/controller/replica.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't wait for this hidden gem to trip some people up haha
src/forge/controller/replica.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the raw error if you let this through?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the service waits forever for a request that will never complete
Specifically, this PR:
RecoverableProcMesh
, we remove it altogetherNext PRs: