Memory-based autoscaling for the Broker deployments may not be reachable in certain circumstances #1520

Ectelion · 2020-07-27T19:13:49Z

For the Broker deployments — Fanout, Retry, Ingress — we have a considerable span between requested memory and the memory limit e.g. "500Mi" and "3000Mi" for Fanout, respectively. At the same time, when the HPA is initialized, it includes a parameter for memory-based auto-scaling, and the threshold is currently selected as half of the memory limit, e.g. "1500Mi" in this case. Note that "1500Mi" is 3 times as large as the initially requested memory. That is, during an increased load, the deployment would first need to be scaled up quite considerably before it scales out. When a deployment is scheduled for a node, the "requested" memory is what it is guaranteed to get, however, there seem to be no guarantee that the deployment will be allowed to grab all the resources up to the limit at any arbitrary point. For instance, imagine that the pod was scheduled for a node that has "800Mi" memory left.

As a result, the following might be possible:

Under certain circumstance, the deployment won’t be able to hit the HPA threshold and thus won’t scale horizontally when it might need to;
Even if there are enough resources, under certain load pattern, the memory scale up may not always catch up fast enough with the increased load and hence the deployment may run into perf/memory issues before the HPA has a chance to kick in;

Action items

It might be worth considering this during perf/tuning experiments;
One potential fix is to set the target memory consumption threshold on the HPA below requested memory (as opposed to limit);

Additional context
It seems that occasional OOM issues were previously observed on the Broker.

liu-cong · 2020-08-04T17:05:37Z

Need to figure out #1545 first

Ectelion · 2020-10-06T22:23:41Z

This is now fixed for Ingress and Fanout as part of #1600 and #1601. Retry configs are upcoming (#1602) and will most likely account for this as well. We can close this after #1602.

Ectelion · 2020-10-19T18:20:42Z

This is now addressed for all the BrokerCell deployments.

Ectelion added kind/bug Something isn't working area/performance area/broker labels Jul 27, 2020

grantr added this to Backlog in GCP Broker via automation Jul 28, 2020

grantr added priority/1 Blocks current release defined by release/* label or blocks current milestone release/2 labels Jul 28, 2020

grantr added this to the Backlog milestone Jul 28, 2020

grantr moved this from Backlog to R1 Blockers in GCP Broker Sep 14, 2020

grantr moved this from R1 Blockers to Backlog in GCP Broker Sep 14, 2020

This was referenced Sep 14, 2020

Decide whether we should set request=limit for memory for broker data plane #1545

Closed

Adjust broker fanout resource requirements to meet default delivery attempts per second target #1601

Closed

Ectelion mentioned this issue Sep 22, 2020

Minimize resource consumption by a single fanout pod while still supporting the default max load targets #1713

Closed

Ectelion closed this as completed Oct 19, 2020

GCP Broker automation moved this from Backlog to Done Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory-based autoscaling for the Broker deployments may not be reachable in certain circumstances #1520

Memory-based autoscaling for the Broker deployments may not be reachable in certain circumstances #1520

Ectelion commented Jul 27, 2020

liu-cong commented Aug 4, 2020

Ectelion commented Oct 6, 2020 •

edited

Loading

Ectelion commented Oct 19, 2020

Memory-based autoscaling for the Broker deployments may not be reachable in certain circumstances #1520

Memory-based autoscaling for the Broker deployments may not be reachable in certain circumstances #1520

Comments

Ectelion commented Jul 27, 2020

liu-cong commented Aug 4, 2020

Ectelion commented Oct 6, 2020 • edited Loading

Ectelion commented Oct 19, 2020

Ectelion commented Oct 6, 2020 •

edited

Loading