Skip to content

Conversation

DNXie
Copy link
Member

@DNXie DNXie commented Sep 12, 2025

This PR refactors how services are spawned for ForgeActor instances, providing a more actor-centric and user-friendly API. It is based on the discussion in #133

Before:

policy = await spawn_service(
    ServiceConfig(num_replicas=1, procs_per_replica=2),
    Policy,
    **cfg.policy,
)
shutdown_service(policy)

After:

Option 1: pass service configs

policy = await Policy.options(num_replicas=1, procs_per_replica=2).as_service(**cfg.policy)

Option 2: pass ServiceConfig object directly

cfg = ServiceConfig(num_replicas=1, procs_per_replica=2)
policy = await Policy.options(service_config=cfg).as_service(**cfg.policy)

Option 3: use the default configuration

# by default uses num_replicas=1, procs_per_replica = 1
policy = await Policy.as_service(**cfg.policy)

To shutdown:

policy.shutdown()

Changes introduced:

  • Added ForgeActor.options() class method, which returns a pre-configured ForgeActor subclass.
  • Updated ForgeActor.as_service() to spawn the service asynchronously and return a ServiceInterface.
  • Added ServiceInterface.shutdown() to allow shutting down the underlying service directly.
  • Updated test files tests/unit_tests/test_service.py and tests/integration_tests/test_policy_update.py.
  • Added two new unit tests in tests/unit_tests/test_service.py to verify all three construction options (explicit ServiceConfig, implicit kwargs, and defaults) work correctly.
  • Updated all service usages in code to align with the new API.
  • Removed spawn_service() and shutdown_service().

Test:

pytest tests/unit_tests/test_service.py
pytest tests/integration_tests/test_policy_update.py
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
python -m apps.rl.main --config apps/rl/llama3_8b.yaml
python -m apps.vllm.main --config apps/vllm/llama3_8b.yaml

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 12, 2025
@DNXie DNXie changed the title [WIP] Refactor service spawning: add ForgeActor.options().as_service() API Refactor service spawning: add ForgeActor.options().as_service() API Sep 15, 2025
@DNXie
Copy link
Member Author

DNXie commented Sep 15, 2025

@allenwang28 I removed spawn_service and shutdown_service as we don't need them anymore. How about spawn_service_v2, shutdown_service_v2? Actually the question is: do we need ServiceActor (as well as ServiceInterfaceV2)? It is never used.

And the way how ServiceActor initiates Service, looks buggy to me.

@DNXie DNXie requested a review from allenwang28 September 15, 2025 05:09
@DNXie DNXie marked this pull request as ready for review September 15, 2025 15:52
@DNXie DNXie requested a review from joecummings September 15, 2025 15:52
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the ConfiguredService piece altogether? options() should return a type["ForgeActor"]

Copy link
Member Author

@DNXie DNXie Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason we have the ConfiguredService wrapper is to allow direct access to actor endpoints via instance attributes (e.g., service.value.choose()) while still carrying the ServiceConfig on the class.

If we remove the wrapper and make options() return a subclass of ForgeActor directly, the endpoints (like counter.value) are only accessible through the underlying ServiceInterface. So we need to change the statement

await service.value.choose()

to

await service._service_interface.value.choose()

I attempted to delegate EndpointProperty access via __getattr__ like this

def __getattr__(self, item):
    if self._service_interface is None:
        raise AttributeError(f"Service not started yet; cannot access '{item}'")
    
    attr = getattr(self._service_interface, item)
    from monarch._src.actor.endpoint import EndpointProperty
    if isinstance(attr, EndpointProperty):
        # Call the descriptor's __get__ to bind it
        return attr.__get__(self._service_interface, type(self._service_interface))

    return attr

However, it didn’t fully work: Python descriptors like EndpointProperty only bind correctly when accessed through the class or a properly initialized ServiceInterface instance. The __get__ call in __getattr__ does not fully replicate the descriptor binding behavior.

I’m still getting familiar with the internals of Monarch and some of the Python descriptor mechanics, so I'd love to hear any suggestions if there’s a cleaner way to handle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, EndpointProperty binding is only needed after we do as_service() right?

this is how I imagine the options piece:

class ForgeActor(Actor):
    _service_config: ServiceConfig | None = None

    @classmethod
    def options(
        cls,
        *,
        service_config: ServiceConfig | None = None,
        num_replicas: int | None = None,
        procs_per_replica: int | None = None,
        **service_kwargs,
    ) -> Type["ForgeActor"]:
        if service_config:
            config = service_config
        else:
            config = ServiceConfig(num_replicas=num_replicas, procs_per_replica=procs_per_replica)  
        return type(
            f"{cls.__name__}",
            (cls,),
            {"_service_config": config}
        )```

Copy link
Member Author

@DNXie DNXie Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EndpointProperty binding is only needed after we do as_service() right?

Yes. It is needed whenever we call an endpoint function.

Your proposed implementation seems reasonable. However, it does not handle endpoint binding. After as_service(), you still need to go through service._service_interface to access endpoints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, if as_service returns a ServiceInterface directly, we cannot terminate the service with

service.shutdown()

because the returned object (service) is just a ServiceInterface. In that case, we’d have to fall back to

shutdown_service(service)

Personally, I prefer the service.shutdown() style since it feels more natural and object-oriented, but I’m okay with either one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add def shutdown(self) directly to ServiceInterface

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

DNXie and others added 3 commits September 15, 2025 12:18
@allenwang28
Copy link
Contributor

Actually the question is: do we need ServiceActor (as well as ServiceInterfaceV2)? It is never used.

yeah please leave it for now, I'm gonna clean this all up once HostMesh lands 😅

@DNXie DNXie requested a review from allenwang28 September 16, 2025 18:30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine, just curious if this is a circular dependency issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There would be a circular dependency issue

Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful, thank you! Can you message the Forge devs group about the updates? There are a few PRs in flight that will have to rebase, please let them know what changed and how to rebase

@DNXie DNXie merged commit 06f9296 into meta-pytorch:main Sep 16, 2025
5 checks passed
@DNXie DNXie deleted the add_options branch September 16, 2025 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants