Skip to content
This repository has been archived by the owner on Jun 19, 2022. It is now read-only.

Memory-based autoscaling for the Broker deployments may not be reachable in certain circumstances #1520

Closed
Ectelion opened this issue Jul 27, 2020 · 3 comments
Labels
area/broker area/performance kind/bug Something isn't working priority/1 Blocks current release defined by release/* label or blocks current milestone release/2
Projects
Milestone

Comments

@Ectelion
Copy link
Contributor

For the Broker deployments — Fanout, Retry, Ingress — we have a considerable span between requested memory and the memory limit e.g. "500Mi" and "3000Mi" for Fanout, respectively. At the same time, when the HPA is initialized, it includes a parameter for memory-based auto-scaling, and the threshold is currently selected as half of the memory limit, e.g. "1500Mi" in this case. Note that "1500Mi" is 3 times as large as the initially requested memory. That is, during an increased load, the deployment would first need to be scaled up quite considerably before it scales out. When a deployment is scheduled for a node, the "requested" memory is what it is guaranteed to get, however, there seem to be no guarantee that the deployment will be allowed to grab all the resources up to the limit at any arbitrary point. For instance, imagine that the pod was scheduled for a node that has "800Mi" memory left.

As a result, the following might be possible:

  1. Under certain circumstance, the deployment won’t be able to hit the HPA threshold and thus won’t scale horizontally when it might need to;
  2. Even if there are enough resources, under certain load pattern, the memory scale up may not always catch up fast enough with the increased load and hence the deployment may run into perf/memory issues before the HPA has a chance to kick in;

Action items

  • It might be worth considering this during perf/tuning experiments;
  • One potential fix is to set the target memory consumption threshold on the HPA below requested memory (as opposed to limit);

Additional context
It seems that occasional OOM issues were previously observed on the Broker.

@grantr grantr added this to Backlog in GCP Broker via automation Jul 28, 2020
@grantr grantr added priority/1 Blocks current release defined by release/* label or blocks current milestone release/2 labels Jul 28, 2020
@grantr grantr added this to the Backlog milestone Jul 28, 2020
@liu-cong
Copy link
Contributor

liu-cong commented Aug 4, 2020

Need to figure out #1545 first

@Ectelion
Copy link
Contributor Author

Ectelion commented Oct 6, 2020

This is now fixed for Ingress and Fanout as part of #1600 and #1601. Retry configs are upcoming (#1602) and will most likely account for this as well. We can close this after #1602.

@Ectelion
Copy link
Contributor Author

This is now addressed for all the BrokerCell deployments.

GCP Broker automation moved this from Backlog to Done Oct 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/broker area/performance kind/bug Something isn't working priority/1 Blocks current release defined by release/* label or blocks current milestone release/2
Projects
GCP Broker
  
Done
Development

No branches or pull requests

3 participants