Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: RefCount allocation to host many room in same pod #1197

Open
neuecc opened this issue Nov 29, 2019 · 9 comments
Open

Proposal: RefCount allocation to host many room in same pod #1197

neuecc opened this issue Nov 29, 2019 · 9 comments
Labels

Comments

@neuecc
Copy link

@neuecc neuecc commented Nov 29, 2019

Is your feature request related to a problem? Please describe.

The current agones allocation model is only support to assign single room in Pod.
image

This provides a complete model of Isolation, but at the cost of Runtime and Game Engine,
it loses many sharable memory and pay much sleep time.
If the room is not hosting a large number of users like BattleRoyal,
a smaller user (2-8) and command frequency is low, we are able to host multiple rooms on one game engine(pod).

image

Does not share state between rooms, but shares execution engine (Load immutable gamedatas in memory, Runtime JIT, dynamic assemblies, etc...).
This can result in significant cost savings.

We're creating mobile game that run in .NET Core Server and Unity Client.
In .NET Core, we're using MagicOnion, OpenSource our network engine built on .NET Core/gRPC/Http2.
The cost of a one-room game loop on the server is relatively small, and many rooms can coexist on a single engine.
We want to host this real-time server on Agones, but we need to host many rooms on same pod to reduce costs.

Many mobile games requires state-ful and realtime server but not request heavy traffic per single room.
Therefore, since the runtime cost is relatively high, we want to host multiple rooms.
(If the runtime cost is relatively low, single room is fine.)

Describe the solution you'd like

I suggest RefCount (virtual allocation).
Allocate requests returns same pod and increment reference count.
When the finish games, request shutdown to decrement reference count.
If the reference count exceeds the value set in the configuration,
returns the server created for the new pod.

@neuecc neuecc added the kind/feature label Nov 29, 2019
@roberthbailey

This comment has been minimized.

Copy link
Member

@roberthbailey roberthbailey commented Nov 30, 2019

I've heard a few requests for this sort of enhancement so it's definitely worth exploring and seeing how it would work.

There are a few of things I'm concerned about:

  1. If we want packed allocations and we keep replacing connections into the same engine, we need a way to "drain" everyone off the game server instance (e.g. for upgrades or node maintenance). With the current design this can be easily achieved using k8s primitives with the assumption that each pod has a reasonably short lifespan (order of minutes).
  2. We need a way to keep track of capacity for each engine instance (room). We could assume each virtual allocation uses the same resources, but it would probably be better to be able to do something based on player tracking.
  3. We have a single SDK for the whole pod, so we need a backwards compatible way to introduce virtual allocations into the existing SDK.
  4. We need to think hard about the lifecycle of a virtual allocation since it will be a bit different than a allocation where we know the lifecycle of the engine is tied to the lifecycle of a container. How do healthchecks work? Is it possible to detect that one virtual allocation is unhealthy but the others in the same allocation are fine? Is it possible to rectify this situation if we can detect it?

There are probably other things that will come up, but thinking through these will be a good place to start.

@logixworx

This comment has been minimized.

Copy link

@logixworx logixworx commented Dec 1, 2019

Definitely need this. I wrote a multi-room system within a single server instance for Unity/Mirror.

@roberthbailey

This comment has been minimized.

Copy link
Member

@roberthbailey roberthbailey commented Dec 2, 2019

@logixworx - Are there any details about that system that you can share (or point to any public documentation)? Do you have any requirements that weren't described in the original post? Do you have any insights that we can leverage in Agones? Do you have any thoughts about the questions I posed above?

@logixworx

This comment has been minimized.

Copy link

@logixworx logixworx commented Dec 5, 2019

My requirements are exactly as described in the original post, as well as in your response. I can't think of anything more to contribute to the discussion at the moment.

@markmandel

This comment has been minimized.

Copy link
Member

@markmandel markmandel commented Dec 5, 2019

So I have quite a few thoughts, and it comes under several categories. This is a topic that has come up quite regularly over the years.

I'm going to use the term "Game Session" as the term I've grow comfortable with for a separate session / room within a Game Server.

There's some implementation details below, but I'm mostly trying to keep them to describe my ideas - please consider them sacrificial drafts.

Configuration

  • How do we configure how many Sessions are available per Game Server. I feel like a simple ref count is going to be too simplistic, especially when we want to expand it down the line (as feature creep over time is always a reality)
    • An easy option might be to have a configuration number attribute on GameServer CRDs.
  • To @roberthbailey 's point we likely want to have lifecycle states of each Session, and ability to manage them separately.
  • A solution could be to have a GameSession CRD resource that gets created for each Session for a GameServer. This would put extra load on the K8s API/etcd, but would give us maximum flexibility and visibility. (is there a max number of CRD records we can create in a K8s cluster?)
    • Each session could have Ready, Allocated states etc individually, and be queryable by all the tools available.
    • This gives up tracking across each GameServer->Session using standard K8s ownership metadata and labels.
    • I think we could reuse lots of the existing SDKs and life-cycle - e.g. at first pass, SDK.Ready() means that all Sessions are Ready at start, and then expand as needed SDK.SessionReady(sessionId) (as a thought).

Data Communication Routing

  • The Game Server binary will need a way to get all it's Sessions, and its states, so it can take in traffic and route it to the appropriate internal game session process. This can probably be done by providing this information through the SDK.
  • I think there are two potential routing options (anyone have anything else?):
    1. A token is provided in the data packet for which session to send data through/back from
    2. Have separate ports for each session, and route to a separate session process based on the port.
  • The onus is on the game developer to handle this routing (i.e. not Agones), but we should probably keep this in mind, to make sure we don't block either approach
  • Open question - do we need to assign specific ports to specific sessions on a GameServer? I'm not sure how to do this. Or can we pass this responsibility up to the user?

Allocation

  • Question: Should the a Session allocation happen on a different API path than GameServer Allocation?
  • My currently thinking is "no".
  • As we expand out, we're already also talking about being able to allocate based on player capacity (#1033) down the line as well. I can only assume we'll have more allocation requirements as we continue to grow.
  • So we should work out if there is a nice way we can extend the current allocation resource API to include this feature, and potentially more down the line. Let's not box ourselves into a corner with a implementation that works specifically for this, but not for others (another point against a refcount)
  • I expect there will be good code reuse here also keeping allocation down a single code path.
  • The idea with Allocation would be to handle, when a Session is requested to pack sessions appropriately, and Allocate backing GameServers when needed, and mark GameSessions as allocated as needed -- and return that data when done.
  • To @roberthbailey 's point as well - I feel like we will need a "Drain" (or similar) state for a GameServer that basically does a "stop allocating to this GameServer and then shut it down once its empty"
    • This would also be useful for player capacity based allocation (although we should be smart about what "empty" means, and how it is defined)

Other Thoughts

  • Originally, I had thought this might be better as a separate framework that sat on top of Agones. The more I think about it, I think this is non-optimal, as this functionality is better weaved into Agones such that it is opt-in as needed. There is a lot of Agones core we can reuse here, which we would have to rebuild otherwise, and the value isn't there.

Would love comments, thoughts and questions on the above.

@theminecoder

This comment has been minimized.

Copy link

@theminecoder theminecoder commented Dec 7, 2019

How do we configure how many Sessions are available per Game Server. I feel like a simple ref count is going to be too simplistic, especially when we want to expand it down the line (as feature creep over time is always a reality)

From an api standpoint it would be nice to just be able to call something like RequestSession() as much as I need. I have a system that allow us to fully clean the server from each session so being able to run the server for as long as possible is better then having to wait for a new container to boot.

We could still have the config option that equates to max total sessions the instance is allowed to make which would make the sidecar return an error when requesting a session once the limit has passed. This could fit into draining/updating as well by returning the error once the instance has been told to drain/update.

@markmandel

This comment has been minimized.

Copy link
Member

@markmandel markmandel commented Dec 9, 2019

From an api standpoint it would be nice to just be able to call something like RequestSession() as much as I need.

What is "it" in the above? What exactly would "RequestSession()" be doing here? Is this a SDK level API, or is more something like Allocation ?

One thing I didn't mention in the above, was I think we would also need a way to self-allocate a GameSession through the SDK - much like we do for a GameServer - so something like SDK.SessionAllocate(sessionId), to support certain workflows - is that in the vein of what you are thinking?

Then you can move the GameSesssion back to SDK.SessionReady(sessionId) when you want it to go back to being in the pool of available Allocatable GameSessions.

Actually SDK.SessionReady(sessionId) to return a session to a Ready state would be useful if it was allocated through the Allocation endpoint as well. (and if all sessions are Ready, the GameServer should return to Ready too, so it could be scaled down if needed).

I think we are on the same page that the assumption is that in this instance, a GameServer will likely be up and running for longer than a singular game Session 👍

@logixworx

This comment has been minimized.

Copy link

@logixworx logixworx commented Jan 10, 2020

Is there an ETA on this feature? I need it asap. Thanks!

@markmandel

This comment has been minimized.

Copy link
Member

@markmandel markmandel commented Jan 10, 2020

@logixworx no ETA as of yet. We've got a variety of users who are interested in this as well, so I would expect it to come at some point this year.

We still need to do a complete design document on this, as well as implementation - it's a pretty substantial piece of work, so I expect it will take some months to complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.