FleetAutoScaler Buffer policy does not notice Unhealthy/Stopping gameservers #423

Oleksii-Terekhov · 2018-11-20T15:11:44Z

Bug: When gs goes to stop from Allocated state, fas doesn't start new gs until old pod doesn't disappear

Now gameServer may be in statuses:

before allocating (Creating Starting Scheduled RequestReady Ready)
allocated (Allocated)
stopping (Shutdown Error Unhealthy)

My setup: gs with additional sidecar container. After session end gs in state Unhealthy about 30-60 seconds due sidecar stop. Sometimes during this time i cannot play - all MinReplicas+BufferSize servers unhealthy (but for fas - they READY or ALLOCATED :(( )

Proposal: don't count Shutdown Error Unhealthy in applyBufferPolicy() if we didn't reach MaxReplicas

Code snippet from pkg\fleetautoscalers\fleetautoscalers.go

...
func applyBufferPolicy(b *stablev1alpha1.BufferPolicy, f *stablev1alpha1.Fleet) (int32, bool, error) {
...
		replicas = f.Status.AllocatedReplicas + int32(b.BufferSize.IntValue())
...
		replicas = int32(math.Ceil(float64(f.Status.AllocatedReplicas*100) / float64(100-bufferPercent)))
...

The text was updated successfully, but these errors were encountered:

victor-prodan · 2018-11-20T18:18:39Z

The current behavior is done on purpose, to prevent flooding the cluster with servers.

Your buffer size must account for init and shutdown times.

markmandel · 2018-11-20T23:29:32Z

Is this something we could document better? Would love suggestions if we there are particular places we can do that.

Oleksii-Terekhov · 2018-11-21T09:05:24Z

Hmm... maybe more aggressive fas strategy as configured parameter?
How about precise metrics for tune buffer size:

prometeus metrics like gameserver{clustername="", namespace="", fleet_name="", status=""} int_value
any rest api on controller GET /namespace/fleet or fas name/status with json result(count of gs with each status)

markmandel · 2018-11-22T23:59:03Z

A couple of interesting things from this:

The sidecar should shutdown as soon as it gets a terminate signal. If it isn't, that's a bug.
Does the webhook design proposed in Horizontal Fleet Autoscaling #334 (comment) solve your issue?

Oleksii-Terekhov · 2018-11-23T09:00:32Z

my sidecar get/process/send some post-round metrics before exit - they shutdown ASAP, but non-zero processing time cause my issue with non-aggressive spawn in fas
webhook looks good.. but external resolver will get current fleet state at each request - they need easy way to receive fleet summary

markmandel · 2018-11-23T09:38:16Z

What do you mean by "fleet summary" explicitly? We are passing through the fleet status, which includes counts.

That all being said, the webhook implantation can always access the k8s API for any extra information it needs.

Oleksii-Terekhov · 2018-11-23T11:52:21Z

current state

Status:
  Allocated Replicas:  0
  Ready Replicas:      2
  Replicas:            2

does not reflect my core problem - gs in shutdown+unhealthy state :)
i afraid if web hook function will call each gs at each call from fas - they flood-kill kubeapi :(

markmandel · 2018-11-23T21:10:27Z

I take it subtracting allocated and ready from total replicas doesn't give you what you want? We could always add more values to the status totals as we find them necessary.

Also the webhook will only fire every 30s, just like the current buffer autoscaler - for exactly the same reason as @victor-prodan described.

Oleksii-Terekhov · 2018-11-26T13:58:47Z

OK I tune buffer size. Current fleet "split brain" situation (max-min-buffer in fas, current count in flt) make this process ...complicated.

Do you have any plans about metrics for Prometheus or internal k8s /metrics?

cyriltovena · 2018-11-26T14:06:13Z

Hey @Oleksii-Terekhov ,

I'm currently working on metrics, and the first exporter option will be Prometheus.
Do you have some specific metrics in mind you would like to see ?

I currently have implemented :

fleet replicas count per fleet per type
game server count per status state
go metrics of the controller
process metrics of the controller
healthcheck metrics of each controller health

let me know !

Oleksii-Terekhov · 2018-11-26T14:52:56Z

With dreams about multi-cluster:
gameserver{clustername="", namespace="", fleet_name="", status=""} int_value

And maybe some info about fas
fas_max{clustername="", namespace="", fleet_name=""} int_value
fas_min{clustername="", namespace="", fleet_name=""} int_value
fas_buffer{clustername="", namespace="", fleet_name=""} int_value

cyriltovena · 2018-11-26T15:30:09Z

I think prometheus will automatically add the namespace.

I will add the fleet_name in the count of gameservers, good idea, fas metric seems doable I'll make sure it's on the first draft.

Thanks !

markmandel mentioned this issue Nov 24, 2018

Horizontal Fleet Autoscaling #334

Closed

Oleksii-Terekhov closed this as completed Nov 26, 2018

markmandel added the question I have a question! label Nov 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FleetAutoScaler Buffer policy does not notice Unhealthy/Stopping gameservers #423

FleetAutoScaler Buffer policy does not notice Unhealthy/Stopping gameservers #423

Oleksii-Terekhov commented Nov 20, 2018

victor-prodan commented Nov 20, 2018

markmandel commented Nov 20, 2018

Oleksii-Terekhov commented Nov 21, 2018 •

edited

markmandel commented Nov 22, 2018

Oleksii-Terekhov commented Nov 23, 2018

markmandel commented Nov 23, 2018

Oleksii-Terekhov commented Nov 23, 2018 •

edited

markmandel commented Nov 23, 2018

Oleksii-Terekhov commented Nov 26, 2018

cyriltovena commented Nov 26, 2018

Oleksii-Terekhov commented Nov 26, 2018

cyriltovena commented Nov 26, 2018

FleetAutoScaler Buffer policy does not notice Unhealthy/Stopping gameservers #423

FleetAutoScaler Buffer policy does not notice Unhealthy/Stopping gameservers #423

Comments

Oleksii-Terekhov commented Nov 20, 2018

victor-prodan commented Nov 20, 2018

markmandel commented Nov 20, 2018

Oleksii-Terekhov commented Nov 21, 2018 • edited

markmandel commented Nov 22, 2018

Oleksii-Terekhov commented Nov 23, 2018

markmandel commented Nov 23, 2018

Oleksii-Terekhov commented Nov 23, 2018 • edited

markmandel commented Nov 23, 2018

Oleksii-Terekhov commented Nov 26, 2018

cyriltovena commented Nov 26, 2018

Oleksii-Terekhov commented Nov 26, 2018

cyriltovena commented Nov 26, 2018

Oleksii-Terekhov commented Nov 21, 2018 •

edited

Oleksii-Terekhov commented Nov 23, 2018 •

edited