Skip to content

Conversation

DNXie
Copy link
Member

@DNXie DNXie commented Sep 17, 2025

This PR refactors the Service routing logic into separate router classes.

Changes:

  • Moved the routing logic into independent router class: RoundRobinRouter, LeastLoadedRouter, and SessionRouter.
  • Added a Router interface.
  • Refactored Service class accordingly.
  • Added unit tests for the added router classes and integration tests for SessionRouter and LeastLoadedRouter behavior and fallback routing.

Test:

pytest tests/unit_tests/test_service.py

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 17, 2025
@DNXie DNXie changed the title [WIP] Move routing logic to classes Refactor Service routing into separate router classes and make routers customizable Sep 17, 2025
@DNXie DNXie marked this pull request as ready for review September 17, 2025 22:41
@DNXie DNXie requested a review from allenwang28 September 17, 2025 22:41
@DNXie DNXie changed the title Refactor Service routing into separate router classes and make routers customizable Refactor Service routing into separate router classes Sep 17, 2025
Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great start!!!

def get_replica(
self,
replicas: List["Replica"],
sess_id: str | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sess_id: str,

i.e. don't assume None is passable for this or session_map. I would also get rid of the checks below

Copy link
Member Author

@DNXie DNXie Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because I have a interface for routers: (also see interface.py)

class Router(ABC):
    """Abstract base class for routing logic."""

    @abstractmethod
    def get_replica(
        self,
        healthy_replicas: List[Replica],
        sess_id: str | None = None,
        session_map: Dict[str, int] | None = None,
    ) -> Replica:
        """Select a replica from the list based on routing logic."""
        pass

For RRRouter and LeastLoadedRouter, this could be None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok, I'm not sure how I feel about that longer term but that's ok for now


@pytest.mark.timeout(10)
@pytest.mark.asyncio
async def test_round_robin_router_distribution():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these router integration tests, are these not already covered in the above tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most functionalities are covered. But the above tests are only testing router itself. The integration tests are testing routing behaviors through a service. If you have concerns about the CI overhead, we could get ride of these except test_round_robin_router_distribution, which is the only test for roundrobin.

Copy link
Member Author

@DNXie DNXie Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed one integration tests and 2 other unit tests due to overlapped coverage, so now just one unit test and two integration tests:

  • test_session_router_with_round_robin_fallback: test different fallback router for the SessionRouter. Also tests the correctness of LeastLoadedRouter.
  • test_round_robin_router_distribution (integration): the only test case for RR logic
  • test_session_router_assigns_and_updates_session_map_in_service (integration): is to test whether the session_map modified in Router can get updated properly in Service class.

So I think we should keep both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@DNXie DNXie requested a review from allenwang28 September 18, 2025 03:08
return replica


class LeastLoadedRouter(Router):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: BalancedRouter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LeastConnectedRouter would be the most canonically accurate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!

See Allen's discussion on this in another thread.

I think LeastConnectedRouter would be the most canonically accurate

I agree. So let's keep it as LeastConnectedRouter for now.

Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DNXie !

)


class Router(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - I think this Router can actually just be in router.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cam keep it in interface.py since that's where all the interfaces are.

return replica


class LeastLoadedRouter(Router):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LeastConnectedRouter would be the most canonically accurate

def get_replica(
self,
replicas: List["Replica"],
sess_id: str | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok, I'm not sure how I feel about that longer term but that's ok for now


@pytest.mark.timeout(10)
@pytest.mark.asyncio
async def test_round_robin_router_distribution():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@DNXie DNXie merged commit 0096e72 into meta-pytorch:main Sep 19, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants