Adjust broker ingress resource requirements to meet default events per second target #1600

grantr · 2020-08-18T19:29:30Z

Problem
Once we have an events per second target #1596 and a default event payload size #1599, we can measure the resource requirements for a single ingress pod to meet the target. That's our default resource requirement value.

Persona:
User

Exit Criteria
The broker ingress deployment is created with default resource requirements allowing it to meet the default events per second target.

Additional context (optional)
Part of #1552.

yolocs · 2020-08-24T20:22:21Z

To support the target load, here are proposed values:

CPU 2000m
Mem 2Gi
Pubsub publish buffer 200Mi
HPA CPU 90%, Mem 1800Mi

Here is the reasoning.

CPU

Between 1000-2000 qps, the CPU ranges between 1800m to 3000m. Event size has very little impact on the CPU usage. Setting 2000m CPU allow HPA to kick in near 1000 qps.

Mem

With 1000 qps and 256k event size, the stable memory usage is about ranges between 300Mi to 1Gi with most of the time stay below 500Mi. However, when there is sudden increase on qps or event size, there is always a surge of memory usage which gradually smooths out. The surge sometimes could reach 2Gi. Setting the memory limit to 2Gi helps prevent OOM (HPA on memory has been proved inefficient with memory surges), and it also leaves room for events with greater sizes.

Publish buffer

With 100Mi buffer size (default value), we have started to see significant more 500s with 800 qps 256k size.

Setting 200Mi mitigates this problem (under the same load). I also want to avoid setting a higher buffer size as it might allow a lot more mem accumulation and cause OOM. See comment.

HPA

I want to have CPU being the main factor for auto scaling. I'm trying to set the mem limit being the upper bound I've seen (with the target load). So hopefully in practice it will never use mem to scale up.

liu-cong · 2020-08-24T21:39:35Z

These numbers make sense to me, thanks for the investigation!

Just a couple of followups:

Does the CPU grow relatively linearly with QPS? Just to confirm how good is it to use CPU as a proxy to QPS.
If customer complain about memory, I was wondering if there is an acceptable tradeoff we can make. For example, memory request=1Gi, HPA threshold=800Mi, memory limit=2Gi. This configuration will potentially go OOM if there is a surge of request while no memory available for over commit. But I imagine there could be a sweet point where the chance of OOM is very low.

yolocs · 2020-08-24T22:32:18Z

@liu-cong

Does the CPU grow relatively linearly with QPS? Just to confirm how good is it to use CPU as a proxy to QPS.

Not sure if it's close to "linear" but when under 2000 qps I definitely see more qps causes more cpu and vice versa.

If customer complain about memory, I was wondering if there is an acceptable tradeoff we can make.

I was not considering a different memory requirement/limit; I assumed they have to be the same (as gke recommended). The numbers you mentioned would probably also work.

grantr · 2020-08-24T22:38:20Z

Great research, thanks @yolocs!

On your memory chart I see a surge in usage to 9GiB. Will this be an issue if we set memory limit to 2GiB?

Seems like we should investigate the reason for that memory usage surge to determine if it can be eliminated or mitigated.

yolocs · 2020-08-24T22:45:54Z

@grantr Sorry for the confusion. That surge was caused by a different setup. I was too lazy to cut that part out in my screenshot

grantr · 2020-08-24T23:05:14Z

Ah got it. I still think we should look into why the memory surge is happening :)

yolocs · 2020-08-24T23:22:54Z

@grantr Fair point :) In practice, there are some challenges I have seen so far.

I used "always" in the comment above. That was not precise. It's very common to have mem surges when there is qps/size increase. However, there were also times when no surge happened (with the exact same setup).
We lack the ability to profile during a specific time window. Although I have profiling data, I cannot tell which profile belongs to the mem surging window.

The idea I'm having now is to have two separate sets of tests. 1) Test the memory pattern of an HTTP server 2) Test the memory pattern of a Pubsub publish client. If one of them shows a similar pattern as I have seen, then it could be cause.

yolocs · 2020-08-26T18:52:33Z

In all the follow-up tests, I was not able to reliably reproduce the memory surges. In fact, under gradual qps/size increase, I observed no obvious memory usage change at all. Although at the certain point, the pubsub publish buffer limit error starts to surface more often.

One amendment for my proposed values is to make Pubsub publish buffer size limit 300Mi. In the follow-up tests, this value more reliably prevents the reaching-buffer-limit error when the load is slightly beyond the target limit.

grantr added release/2 priority/1 Blocks current release defined by release/* label or blocks current milestone kind/feature-request New feature or request area/broker labels Aug 18, 2020

grantr added this to Backlog in GCP Broker via automation Aug 18, 2020

grantr assigned yolocs Aug 19, 2020

grantr added this to the v0.18.0-M1 milestone Aug 19, 2020

grantr added the storypoint/8 label Aug 25, 2020

grantr moved this from Backlog to Prioritized Backlog in GCP Broker Aug 25, 2020

yolocs mentioned this issue Aug 31, 2020

Adjust broker ingress parameters to support target load #1656

Merged

grantr modified the milestones: v0.18.0-M1, v0.18.0-M2 Sep 2, 2020

knative-prow-robot closed this as completed in #1656 Sep 2, 2020

GCP Broker automation moved this from Prioritized Backlog to Done Sep 2, 2020

grantr mentioned this issue Sep 14, 2020

Stress test broker until it's broken #1491

Closed

This was referenced Sep 26, 2020

Handle requests with very large payload #1772

Merged

Decide whether we should set request=limit for memory for broker data plane #1545

Closed

Ectelion mentioned this issue Oct 6, 2020

Memory-based autoscaling for the Broker deployments may not be reachable in certain circumstances #1520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust broker ingress resource requirements to meet default events per second target #1600

Adjust broker ingress resource requirements to meet default events per second target #1600

grantr commented Aug 18, 2020

yolocs commented Aug 24, 2020

liu-cong commented Aug 24, 2020

yolocs commented Aug 24, 2020

grantr commented Aug 24, 2020

yolocs commented Aug 24, 2020

grantr commented Aug 24, 2020

yolocs commented Aug 24, 2020

yolocs commented Aug 26, 2020

Adjust broker ingress resource requirements to meet default events per second target #1600

Adjust broker ingress resource requirements to meet default events per second target #1600

Comments

grantr commented Aug 18, 2020

yolocs commented Aug 24, 2020

CPU

Mem

Publish buffer

HPA

liu-cong commented Aug 24, 2020

yolocs commented Aug 24, 2020

grantr commented Aug 24, 2020

yolocs commented Aug 24, 2020

grantr commented Aug 24, 2020

yolocs commented Aug 24, 2020

yolocs commented Aug 26, 2020