Skip to content
This repository has been archived by the owner on Jun 19, 2022. It is now read-only.

Adjust broker ingress resource requirements to meet default events per second target #1600

Closed
grantr opened this issue Aug 18, 2020 · 8 comments · Fixed by #1656
Closed

Adjust broker ingress resource requirements to meet default events per second target #1600

grantr opened this issue Aug 18, 2020 · 8 comments · Fixed by #1656
Assignees
Labels
area/broker kind/feature-request New feature or request priority/1 Blocks current release defined by release/* label or blocks current milestone release/2 storypoint/8
Projects
Milestone

Comments

@grantr
Copy link
Contributor

grantr commented Aug 18, 2020

Problem
Once we have an events per second target #1596 and a default event payload size #1599, we can measure the resource requirements for a single ingress pod to meet the target. That's our default resource requirement value.

Persona:
User

Exit Criteria
The broker ingress deployment is created with default resource requirements allowing it to meet the default events per second target.

Additional context (optional)
Part of #1552.

@grantr grantr added release/2 priority/1 Blocks current release defined by release/* label or blocks current milestone kind/feature-request New feature or request area/broker labels Aug 18, 2020
@grantr grantr added this to Backlog in GCP Broker via automation Aug 18, 2020
@grantr grantr added this to the v0.18.0-M1 milestone Aug 19, 2020
@yolocs
Copy link
Member

yolocs commented Aug 24, 2020

To support the target load, here are proposed values:

  • CPU 2000m
  • Mem 2Gi
  • Pubsub publish buffer 200Mi
  • HPA CPU 90%, Mem 1800Mi

Here is the reasoning.

CPU

Between 1000-2000 qps, the CPU ranges between 1800m to 3000m. Event size has very little impact on the CPU usage. Setting 2000m CPU allow HPA to kick in near 1000 qps.

Mem

With 1000 qps and 256k event size, the stable memory usage is about ranges between 300Mi to 1Gi with most of the time stay below 500Mi. However, when there is sudden increase on qps or event size, there is always a surge of memory usage which gradually smooths out. The surge sometimes could reach 2Gi. Setting the memory limit to 2Gi helps prevent OOM (HPA on memory has been proved inefficient with memory surges), and it also leaves room for events with greater sizes.

c86f0910-3519-445f-b9c4-2c613a04c89d

Publish buffer

With 100Mi buffer size (default value), we have started to see significant more 500s with 800 qps 256k size.

2c2fc1e9-01af-4776-9ed0-dabd33f35226

Setting 200Mi mitigates this problem (under the same load). I also want to avoid setting a higher buffer size as it might allow a lot more mem accumulation and cause OOM. See comment.

HPA

I want to have CPU being the main factor for auto scaling. I'm trying to set the mem limit being the upper bound I've seen (with the target load). So hopefully in practice it will never use mem to scale up.

@liu-cong
Copy link
Contributor

These numbers make sense to me, thanks for the investigation!

Just a couple of followups:

  1. Does the CPU grow relatively linearly with QPS? Just to confirm how good is it to use CPU as a proxy to QPS.
  2. If customer complain about memory, I was wondering if there is an acceptable tradeoff we can make. For example, memory request=1Gi, HPA threshold=800Mi, memory limit=2Gi. This configuration will potentially go OOM if there is a surge of request while no memory available for over commit. But I imagine there could be a sweet point where the chance of OOM is very low.

@yolocs
Copy link
Member

yolocs commented Aug 24, 2020

@liu-cong

  1. Does the CPU grow relatively linearly with QPS? Just to confirm how good is it to use CPU as a proxy to QPS.

Not sure if it's close to "linear" but when under 2000 qps I definitely see more qps causes more cpu and vice versa.

  1. If customer complain about memory, I was wondering if there is an acceptable tradeoff we can make.

I was not considering a different memory requirement/limit; I assumed they have to be the same (as gke recommended). The numbers you mentioned would probably also work.

@grantr
Copy link
Contributor Author

grantr commented Aug 24, 2020

Great research, thanks @yolocs!

On your memory chart I see a surge in usage to 9GiB. Will this be an issue if we set memory limit to 2GiB?

Seems like we should investigate the reason for that memory usage surge to determine if it can be eliminated or mitigated.

@yolocs
Copy link
Member

yolocs commented Aug 24, 2020

@grantr Sorry for the confusion. That surge was caused by a different setup. I was too lazy to cut that part out in my screenshot

@grantr
Copy link
Contributor Author

grantr commented Aug 24, 2020

Ah got it. I still think we should look into why the memory surge is happening :)

@yolocs
Copy link
Member

yolocs commented Aug 24, 2020

@grantr Fair point :) In practice, there are some challenges I have seen so far.

  1. I used "always" in the comment above. That was not precise. It's very common to have mem surges when there is qps/size increase. However, there were also times when no surge happened (with the exact same setup).

  2. We lack the ability to profile during a specific time window. Although I have profiling data, I cannot tell which profile belongs to the mem surging window.

The idea I'm having now is to have two separate sets of tests. 1) Test the memory pattern of an HTTP server 2) Test the memory pattern of a Pubsub publish client. If one of them shows a similar pattern as I have seen, then it could be cause.

@grantr grantr moved this from Backlog to Prioritized Backlog in GCP Broker Aug 25, 2020
@yolocs
Copy link
Member

yolocs commented Aug 26, 2020

In all the follow-up tests, I was not able to reliably reproduce the memory surges. In fact, under gradual qps/size increase, I observed no obvious memory usage change at all. Although at the certain point, the pubsub publish buffer limit error starts to surface more often.

One amendment for my proposed values is to make Pubsub publish buffer size limit 300Mi. In the follow-up tests, this value more reliably prevents the reaching-buffer-limit error when the load is slightly beyond the target limit.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/broker kind/feature-request New feature or request priority/1 Blocks current release defined by release/* label or blocks current milestone release/2 storypoint/8
Projects
GCP Broker
  
Done
Development

Successfully merging a pull request may close this issue.

3 participants