-
Notifications
You must be signed in to change notification settings - Fork 45
Add as_actor single-actor mode
#195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Wondering what is the motivation behind this? Why would one choose an actor creation directly over service? Single actor seems to be a special case of service? |
For context: #173 You're right though, the rationale is that only vLLM should be a service currently. Trainer for e.g. will not really take advantage of fault tolerance or routing, so we should always expect it to be a singleton. |
src/forge/controller/actor.py
Outdated
| num_replicas: int | None = None, | ||
| procs: int | None = None, | ||
| **service_kwargs, | ||
| procs: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also hosts: int, with_gpu: bool and num_replicas: int | None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of put them all in **kwargs since only procs is required for both service and actor. Do you think it is better to explicitly list them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please explicitly list them
src/forge/controller/actor.py
Outdated
| class_attrs["num_replicas"] = 1 | ||
| cfg = ServiceConfig(**filter_config_params(ServiceConfig, class_attrs)) | ||
|
|
||
| service_cls = type(f"{cls.__name__}Service", (cls,), {"_service_config": cfg}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we still need service_cls here? can the logic of as_service() be:
@classmethod
async def as_service(cls, **actor_kwargs) -> "ServiceInterface":
service = Service(cfg, cls, actor_kwargs)
await service.__initialize__()
return ServiceInterface(service, cls)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! Removed.
src/forge/controller/actor.py
Outdated
| cfg = ProcessConfig(**filter_config_params(ProcessConfig, class_attrs)) | ||
|
|
||
| logger.info("Spawning single actor %s", cls.__name__) | ||
| actor = await cls.launch(process_config=cfg, **actor_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm maybe we can modify the def launch() above to simplify things?
Like this:
@classmethod
async def launch(cls, *args, **kwargs) -> "ForgeActor":
proc_mesh = await get_proc_mesh(process_config=ProcessConfig(procs=cls._procs, hosts=cls._hosts, with_gpu=cls._with_gpu))
actor_name = kwargs.pop("name", cls.__name__)
actor = await proc_mesh.spawn(actor_name, cls, *args, **kwargs)
actor._proc_mesh = proc_mesh
if hasattr(proc_mesh, "_hostname") and hasattr(proc_mesh, "_port"):
host, port = proc_mesh._hostname, proc_mesh._port
await actor.set_env.call(addr=host, port=port)
await actor.setup.call()
return actor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
src/forge/controller/actor.py
Outdated
| num_replicas: int | None = None, | ||
| procs: int | None = None, | ||
| **service_kwargs, | ||
| procs: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please explicitly list them
src/forge/controller/actor.py
Outdated
| # dynamically create a configured subclass for consistency | ||
| cls = type(f"{cls.__name__}Service", (cls,), {"_service_config": cfg}) | ||
| class_attrs = {k: v for k, v in cls.__dict__.items() if not k.startswith("__")} | ||
| if "procs" not in class_attrs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow up comment on explicit attributes, this for e.g. is unclear and can be pretty brittle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in the latest version
src/forge/controller/actor.py
Outdated
| proc_mesh = await get_proc_mesh(process_config=process_config) | ||
| # Build process config from class attributes with defaults | ||
| cfg = ProcessConfig( | ||
| procs=getattr(cls, "procs", 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally try and use getattr as little as possible. If it's used too much it can mask real errors that can be really hard to debug later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fallback when the user doesn’t specify configs via .options(). In this case, the original ForgeActor class doesn’t have attributes like procs. If we are getting rid of getattr, one way I can think of is to add these attributes to ForgeActor class like
class ForgeActor(Actor):
procs: int = 1
hosts: int | None = None
with_gpus: bool = False
num_replicas: int = 1
def __init__(self, *args, **kwargs):But either way, it means the default values are specified in three places:
- In types.py
- As default values in
.options() - As attributes on the
ForgeActorclass OR here inlaunch.
I’m not sure if there’s a cleaner way to handle this. I’ve updated the code accordingly (get rid of getattr), please take a look and let me know if you have any suggestions or improvements.
src/forge/controller/actor.py
Outdated
| actor = await cls.launch(**actor_kwargs) | ||
|
|
||
| # Patch shutdown to bypass endpoint system | ||
| actor.shutdown = types.MethodType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this is a hack, we shouldn't be doing this. I'm guessing it's because we want to preserve the ability to
svc = MyActor.as_service()
await svc.shutdown()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as_service returns a ServiceInterface. So when we call service.shutdown(), we are actually calling ServiceInterface.shutdown
The reason I have to do this hacky thing is:
Without it, actor.shutdown() gives me this error:
RuntimeError: Actor <class 'tests.unit_tests.test_service.Counter'>.shutdown is not annotated as an endpoint. To call it as one, add a @endpoint decorator to it, or directly wrap it in one as_endpoint(obj.method).call(...)If I simply decorate shutdown with @endpoint, we'd have to call it like
await actor.shutdown.call()
But it would still give error:
AssertionError("Called shutdown on a replica with no proc_mesh.")Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see. Ok in that case, I think what we should do is not do actor.shutdown() for now, and just rely on eg
await RLTrainer.stop(trainer)
for now. Maybe what we can do next is have the provisioner keep track of all of the proc meshes, and do a global shutdown()? Including all the services etc. we can discuss more, just want to unblock this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Done!
src/forge/controller/actor.py
Outdated
|
|
||
| @classmethod | ||
| async def launch(cls, *, process_config: ProcessConfig, **kwargs) -> "ForgeActor": | ||
| async def launch(cls, **kwargs) -> "ForgeActor": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add *args here? This solves the *args related TODO that's listed here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in launch and as_actor. Also tested in test_as_actor_with_kwargs_config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ty @DNXie!
src/forge/controller/actor.py
Outdated
| # Option C: skip options, use the default service config with num_replicas=1, procs=1 | ||
| service = await MyForgeActor.as_service(...) | ||
| await service.shutdown() | ||
| Returns a dynamically created subclass of this ForgeActor with bound configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Returns a dynamically created subclass of this ForgeActor with bound configuration. | |
| Returns a version of ForgeActor with configured resource attributes. |
tests/unit_tests/test_service.py
Outdated
| self.kwargs = kwargs | ||
|
|
||
| @endpoint | ||
| async def get_args(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove these tests, i think it's fine without
… service and actor (meta-pytorch#195) * add as_actor * add as_actor * refactor options to support both * rename num_hosts to hosts * rename * refactor actor.py and add more test cases * options stop taking config obj * fix lint * fix ci * fix broken tests * fix lint * remove xxService class * simplify launch * resolve comments * revert shutdown * remove shutdown patch * support args * update docstring * support args in as_service * fix ci
Summary:
This PR adds
ForgeActor.as_actor()and refactorsForgeActor.options()and.as_service()to improve configuration handling and dynamic subclassing for actors and services. Context: #173Key changes include:
as_actor()to support launching a single actor directly.ServiceandActornum_hoststohostsandnum_procstoprocsinProcessConfig..options()now stores all configuration parameters as class attributes rather than building a full config object immediately..as_actor()or.as_service()calls, ensuring each configuration remains isolated..options()is not called.Changes in behavior:
So the single actor initialization from
become
Usage Examples 1 (Actor):
Log:
Usage Examples 2 (Service):
Log when
num_replicas=3The printed class name is its original class name instead of xxService (See #193)
INFO forge.controller.actor:actor.py:123 Spawning Service Actor for Counter INFO forge.controller.actor:actor.py:207 Spawning single actor Counter INFO forge.controller.actor:actor.py:207 Spawning single actor Counter INFO forge.controller.actor:actor.py:207 Spawning single actor CounterUsage Examples 3 (Positional argument):
This means you can now do:
instead of having to use keyword-only arguments like:
Test