Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FleetAutoScaler Buffer policy does not notice Unhealthy/Stopping gameservers #423

Closed
Oleksii-Terekhov opened this issue Nov 20, 2018 · 12 comments
Labels
question I have a question!

Comments

@Oleksii-Terekhov
Copy link

Bug: When gs goes to stop from Allocated state, fas doesn't start new gs until old pod doesn't disappear

Now gameServer may be in statuses:

  • before allocating (Creating Starting Scheduled RequestReady Ready)
  • allocated (Allocated)
  • stopping (Shutdown Error Unhealthy)

My setup: gs with additional sidecar container. After session end gs in state Unhealthy about 30-60 seconds due sidecar stop. Sometimes during this time i cannot play - all MinReplicas+BufferSize servers unhealthy (but for fas - they READY or ALLOCATED :(( )

Proposal: don't count Shutdown Error Unhealthy in applyBufferPolicy() if we didn't reach MaxReplicas

Code snippet from pkg\fleetautoscalers\fleetautoscalers.go

...
func applyBufferPolicy(b *stablev1alpha1.BufferPolicy, f *stablev1alpha1.Fleet) (int32, bool, error) {
...
		replicas = f.Status.AllocatedReplicas + int32(b.BufferSize.IntValue())
...
		replicas = int32(math.Ceil(float64(f.Status.AllocatedReplicas*100) / float64(100-bufferPercent)))
...
@victor-prodan
Copy link
Contributor

The current behavior is done on purpose, to prevent flooding the cluster with servers.

Your buffer size must account for init and shutdown times.

@markmandel
Copy link
Member

Is this something we could document better? Would love suggestions if we there are particular places we can do that.

@Oleksii-Terekhov
Copy link
Author

Oleksii-Terekhov commented Nov 21, 2018

Hmm... maybe more aggressive fas strategy as configured parameter?
How about precise metrics for tune buffer size:

  • prometeus metrics like gameserver{clustername="", namespace="", fleet_name="", status=""} int_value
  • any rest api on controller GET /namespace/fleet or fas name/status with json result(count of gs with each status)

@markmandel
Copy link
Member

A couple of interesting things from this:

  1. The sidecar should shutdown as soon as it gets a terminate signal. If it isn't, that's a bug.
  2. Does the webhook design proposed in Horizontal Fleet Autoscaling #334 (comment) solve your issue?

@Oleksii-Terekhov
Copy link
Author

  1. my sidecar get/process/send some post-round metrics before exit - they shutdown ASAP, but non-zero processing time cause my issue with non-aggressive spawn in fas
  2. webhook looks good.. but external resolver will get current fleet state at each request - they need easy way to receive fleet summary

@markmandel
Copy link
Member

What do you mean by "fleet summary" explicitly? We are passing through the fleet status, which includes counts.

That all being said, the webhook implantation can always access the k8s API for any extra information it needs.

@Oleksii-Terekhov
Copy link
Author

Oleksii-Terekhov commented Nov 23, 2018

current state

Status:
  Allocated Replicas:  0
  Ready Replicas:      2
  Replicas:            2

does not reflect my core problem - gs in shutdown+unhealthy state :)
i afraid if web hook function will call each gs at each call from fas - they flood-kill kubeapi :(

@markmandel
Copy link
Member

I take it subtracting allocated and ready from total replicas doesn't give you what you want? We could always add more values to the status totals as we find them necessary.

Also the webhook will only fire every 30s, just like the current buffer autoscaler - for exactly the same reason as @victor-prodan described.

@Oleksii-Terekhov
Copy link
Author

OK I tune buffer size. Current fleet "split brain" situation (max-min-buffer in fas, current count in flt) make this process ...complicated.

Do you have any plans about metrics for Prometheus or internal k8s /metrics?

@cyriltovena
Copy link
Collaborator

Hey @Oleksii-Terekhov ,

I'm currently working on metrics, and the first exporter option will be Prometheus.
Do you have some specific metrics in mind you would like to see ?

I currently have implemented :

  • fleet replicas count per fleet per type
  • game server count per status state
  • go metrics of the controller
  • process metrics of the controller
  • healthcheck metrics of each controller health

let me know !

@Oleksii-Terekhov
Copy link
Author

With dreams about multi-cluster:
gameserver{clustername="", namespace="", fleet_name="", status=""} int_value

And maybe some info about fas
fas_max{clustername="", namespace="", fleet_name=""} int_value
fas_min{clustername="", namespace="", fleet_name=""} int_value
fas_buffer{clustername="", namespace="", fleet_name=""} int_value

@cyriltovena
Copy link
Collaborator

I think prometheus will automatically add the namespace.

I will add the fleet_name in the count of gameservers, good idea, fas metric seems doable I'll make sure it's on the first draft.

Thanks !

@markmandel markmandel added the question I have a question! label Nov 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question I have a question!
Projects
None yet
Development

No branches or pull requests

4 participants