Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make memory bloat debuggable #1703

Closed
grosser opened this issue Nov 29, 2021 · 20 comments
Closed

make memory bloat debuggable #1703

grosser opened this issue Nov 29, 2021 · 20 comments
Labels
enhancement New feature or request gator cmd stale

Comments

@grosser
Copy link
Contributor

grosser commented Nov 29, 2021

we run with 10gb limit and the webhook still gets oomkilled a lot
Screen Shot 2021-11-29 at 2 45 01 PM

it would be nice to have a "synced objects" or "requests in queue" or other things that could bloat the memory be queryable so we can alert and debug when they go too high

version: afc9fe2

@grosser grosser added the enhancement New feature or request label Nov 29, 2021
@maxsmythe
Copy link
Contributor

It looks like syncing has a prometheus metric called gatekeeper-sync?:

syncM = stats.Int64(syncMetricName, "Total number of resources of each kind being cached", stats.UnitDimensionless)
syncDurationM = stats.Float64(syncDurationMetricName, "Latency of sync operation in seconds", stats.UnitSeconds)
lastRunSyncM = stats.Float64(lastRunTimeMetricName, "Timestamp of last sync operation", stats.UnitSeconds)
kindKey = tag.MustNewKey("kind")
statusKey = tag.MustNewKey("status")
views = []*view.View{
{
Name: syncM.Name(),
Measure: syncM,
Description: syncM.Description(),
Aggregation: view.LastValue(),
TagKeys: []tag.Key{kindKey, statusKey},

Though that wouldn't help if it was a large object (say, a config map) that was causing the memory usage. Is it only the webhook pod and not the audit pod? If so, I wonder if capping the number of webhook goroutines would help.

@willbeason may also be aware of efficiency improvements coming down the pipe.

@willbeason
Copy link
Member

Yep, we've got efficiency improvements in the works which should reduce memory usage by 50-60% (or more, depending on use case). We're doing a lot of performance measuring, so we'll have to think on what metrics are helpful (that is, the ones we know correlated with memory/cpu usage and not just noise). We're also planning on adding ways of measuring memory/CPU usage for ConstraintTemplates in the Gator CLI.

For now (before the efficiency improvements):

  • memory usage for reviewing objects is ~ O(reviews per second * number of constraints * average ConstraintTemplate complexity)
  • memory usage for adding N ConstraintTemplates is ~ O(total ConstraintTemplates ^2)

So if you add a lot of ConstraintTemplates at once (~100) you will experience high memory usage. Each object evaluated requires allocating and deallocating at least 70 MB memory per 1,000 Constraints for simple ConstraintTemplates. Complex ConstraintTemplates use significantly more.

"Constraint Template complexity" is a very rough term, can only be experimentally measured, and has variable impact depending on use. A rough approximation is "longer ConstraintTemplates use more memory to execute queries".

These are intended as rough ways of reasoning about performance, and are not completely generalizable. As with all performance advice, characteristics are dependent on use case.

@grosser
Copy link
Contributor Author

grosser commented Nov 30, 2021

memory bloat seems to only affect gatekeeper, audit is happy at ~2gb
I'll look into capping the number of webhook goroutines

@willbeason
Copy link
Member

memory bloat seems to only affect gatekeeper, audit is happy at ~2gb I'll look into capping the number of webhook goroutines

I hadn't thought of that - that will indeed limit memory throughput since at most that number of requests will be served at a time.

@grosser
Copy link
Contributor Author

grosser commented Jan 10, 2022

same issue with

- name: GOMAXPROCS
  value: '1'

... anyone got more ideas or is this a feature that's missing ?

@maxsmythe
Copy link
Contributor

Try using the --max-serving-threads flag:

var maxServingThreads = flag.Int("max-serving-threads", -1, "(alpha) cap the number of threads handling non-trivial requests, -1 means an infinite number of threads")

@grosser
Copy link
Contributor Author

grosser commented Jan 12, 2022 via email

@willbeason willbeason modified the milestone: gator beta May 31, 2022
@willbeason
Copy link
Member

@grosser Memory consumption for audit should be greatly reduced with Gatekeeper v3.8.x - our benchmarks saw memory usage reduced by 10x or more. Has this improved your situation at all?

The main memory improvements we see aren't debuggable with the Gator CLI, so I'm removing it from the Gator milestone.

@grosser
Copy link
Contributor Author

grosser commented Jun 3, 2022

we were not able to update to 3.8.x yet since it causes lots of oomkills, not a priority atm, will report back when we finally do 🤞

@notchairmk
Copy link

to clarify for v3.8.x a little: audit seems fine but we ran into memory issues with webhook in high volume clusters so had to revert the upgrade. we'll likely have a better idea once #2060 is addressed

@maxsmythe
Copy link
Contributor

but we ran into memory issues with webhook in high volume clusters

For webhook memory usage, can you set GOMAXPROCS to the # of CPU in your pod per #1907 ? Also, for high volume usage, setting --max-serving-threads will limit the number of parallel OPA queries, which may also improve memory usage. I'd start by setting --max-serving-threads equal to the # of CPU in your pod and experiment with tuning later if it appears to help.

@maxsmythe
Copy link
Contributor

maxsmythe commented Jun 3, 2022

The above wont fix lock contention, but it should prevent OOMing

@notchairmk
Copy link

We currently have GOMAXPROCS set, so we should be good there 👍

We did try setting --max-servings-threads with v3.7.1 which seemed to resolve infrequent oomkills (overall slightly decreased memory usage), but caused validation request duration to increase to the point of falling back to the webhook failureMode. Would be good to try when we approach upgrading again, for sure.

@tehlers320
Copy link

I think ours is just processing stuff that doesn't even have a constraint related to it. I tinkered with admissionwebhook changing it from * to a more explicit list and memory has reduced 500mb per pid. But why is this even processing things that have no rules? I wrote a quick-n-dirty admissionwebhook and its sitting at 100m cpu and 125m memory receiving everything (but doing nothing). So GK must be receiving everything but also running it thru all constraints.

Here are some example of things i dont see anybody ever having a rule for which are at the top.

 kubectl -n gatekeeper-system logs  gatekeeper-controller-manager-8c874cbc4-bnl9j |grep received |tr ' ' '\n' |grep Kind |sort |uniq -c |sort -nr
 598 Kind=Lease",
 322 Kind=Event",

@grosser
Copy link
Contributor Author

grosser commented Jul 15, 2022 via email

@ritazh
Copy link
Member

ritazh commented Jul 15, 2022

Thanks for sharing this data @tehlers320!

I tinkered with admissionwebhook changing it from * to a more explicit list and memory has reduced 500mb per pid.

Was this with constraints? If so, how many constraints and constraint templates?

Fyi for our setup I added a validation that always makes sure the admission
we hook is in sync with the resources the constraints need so there is no
overhead or unenforced constraints

@grosser Where did you add this validation?

@grosser
Copy link
Contributor Author

grosser commented Jul 15, 2022 via email

@tehlers320
Copy link

@ritazh

 kubectl -n gatekeeper-system get constrainttemplates.templates.gatekeeper.sh |wc -l
      21
kubectl -n gatekeeper-system get constrainttemplates.templates.gatekeeper.sh -o yaml |wc -l
    1666
ls  gatekeeper-constraints-chart/templates/*/*yaml |wc -l
      22
cat gatekeeper-constraints-chart/templates/*/*yaml |wc -l
     529

im not sure if we are super complex.

@ritazh
Copy link
Member

ritazh commented Jul 20, 2022

you can see how many constraints (policies) you have with something like this:
kubectl get constraints | grep gatekeeper.sh | wc -l

Another thing to evaluate is do all constraints have match kinds? if so, it should only evaluate requests that match those kinds. if no match kind is provided, then all requests could be hitting the constraint and constraint template for evaluation.

e.g. This constraint ensures only requests that contain pod will be evaluated against this constraint and constraint template.

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
  name: prod-repo-is-openpolicyagent
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]

@stale
Copy link

stale bot commented Sep 18, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 18, 2022
@stale stale bot closed this as completed Oct 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gator cmd stale
Projects
None yet
Development

No branches or pull requests

6 participants