make memory bloat debuggable #1703

grosser · 2021-11-29T22:51:50Z

we run with 10gb limit and the webhook still gets oomkilled a lot

it would be nice to have a "synced objects" or "requests in queue" or other things that could bloat the memory be queryable so we can alert and debug when they go too high

version: afc9fe2

The text was updated successfully, but these errors were encountered:

maxsmythe · 2021-11-30T02:36:06Z

It looks like syncing has a prometheus metric called gatekeeper-sync?:

gatekeeper/pkg/controller/sync/stats_reporter.go

Lines 20 to 33 in ad30ce0

    
           syncM         = stats.Int64(syncMetricName, "Total number of resources of each kind being cached", stats.UnitDimensionless) 
        
           syncDurationM = stats.Float64(syncDurationMetricName, "Latency of sync operation in seconds", stats.UnitSeconds) 
        
           lastRunSyncM  = stats.Float64(lastRunTimeMetricName, "Timestamp of last sync operation", stats.UnitSeconds) 
        
           kindKey   = tag.MustNewKey("kind") 
        
           statusKey = tag.MustNewKey("status") 
        
           views = []*view.View{ 
        
           	{ 
        
           		Name:        syncM.Name(), 
        
           		Measure:     syncM, 
        
           		Description: syncM.Description(), 
        
           		Aggregation: view.LastValue(), 
        
           		TagKeys:     []tag.Key{kindKey, statusKey},

Though that wouldn't help if it was a large object (say, a config map) that was causing the memory usage. Is it only the webhook pod and not the audit pod? If so, I wonder if capping the number of webhook goroutines would help.

@willbeason may also be aware of efficiency improvements coming down the pipe.

willbeason · 2021-11-30T16:03:26Z

Yep, we've got efficiency improvements in the works which should reduce memory usage by 50-60% (or more, depending on use case). We're doing a lot of performance measuring, so we'll have to think on what metrics are helpful (that is, the ones we know correlated with memory/cpu usage and not just noise). We're also planning on adding ways of measuring memory/CPU usage for ConstraintTemplates in the Gator CLI.

For now (before the efficiency improvements):

memory usage for reviewing objects is ~ O(reviews per second * number of constraints * average ConstraintTemplate complexity)
memory usage for adding N ConstraintTemplates is ~ O(total ConstraintTemplates ^2)

So if you add a lot of ConstraintTemplates at once (~100) you will experience high memory usage. Each object evaluated requires allocating and deallocating at least 70 MB memory per 1,000 Constraints for simple ConstraintTemplates. Complex ConstraintTemplates use significantly more.

"Constraint Template complexity" is a very rough term, can only be experimentally measured, and has variable impact depending on use. A rough approximation is "longer ConstraintTemplates use more memory to execute queries".

These are intended as rough ways of reasoning about performance, and are not completely generalizable. As with all performance advice, characteristics are dependent on use case.

grosser · 2021-11-30T17:16:14Z

memory bloat seems to only affect gatekeeper, audit is happy at ~2gb
I'll look into capping the number of webhook goroutines

willbeason · 2021-11-30T18:49:23Z

memory bloat seems to only affect gatekeeper, audit is happy at ~2gb I'll look into capping the number of webhook goroutines

I hadn't thought of that - that will indeed limit memory throughput since at most that number of requests will be served at a time.

grosser · 2022-01-10T22:06:40Z

same issue with

- name: GOMAXPROCS
  value: '1'

... anyone got more ideas or is this a feature that's missing ?

maxsmythe · 2022-01-12T02:10:54Z

Try using the --max-serving-threads flag:

gatekeeper/pkg/webhook/policy.go

Line 61 in fbb5d2b

    
           var maxServingThreads = flag.Int("max-serving-threads", -1, "(alpha) cap the number of threads handling non-trivial requests, -1 means an infinite number of threads")

grosser · 2022-01-12T02:21:34Z

Thx, that sounds good

…

On Tue, Jan 11, 2022, 6:11 PM Max Smythe ***@***.***> wrote: Try using the --max-serving-threads flag: https://github.com/open-policy-agent/gatekeeper/blob/fbb5d2b93514e219261714d56ded4d9eb0a9617a/pkg/webhook/policy.go#L61 — Reply to this email directly, view it on GitHub <#1703 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACYZ3RA25UWS7UNKGYAYLUVTPLXANCNFSM5JAJLLVQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

willbeason · 2022-06-03T16:24:55Z

@grosser Memory consumption for audit should be greatly reduced with Gatekeeper v3.8.x - our benchmarks saw memory usage reduced by 10x or more. Has this improved your situation at all?

The main memory improvements we see aren't debuggable with the Gator CLI, so I'm removing it from the Gator milestone.

grosser · 2022-06-03T18:29:44Z

we were not able to update to 3.8.x yet since it causes lots of oomkills, not a priority atm, will report back when we finally do 🤞

notchairmk · 2022-06-03T20:38:36Z

to clarify for v3.8.x a little: audit seems fine but we ran into memory issues with webhook in high volume clusters so had to revert the upgrade. we'll likely have a better idea once #2060 is addressed

maxsmythe · 2022-06-03T20:56:02Z

but we ran into memory issues with webhook in high volume clusters

For webhook memory usage, can you set GOMAXPROCS to the # of CPU in your pod per #1907 ? Also, for high volume usage, setting --max-serving-threads will limit the number of parallel OPA queries, which may also improve memory usage. I'd start by setting --max-serving-threads equal to the # of CPU in your pod and experiment with tuning later if it appears to help.

maxsmythe · 2022-06-03T20:56:53Z

The above wont fix lock contention, but it should prevent OOMing

notchairmk · 2022-06-03T22:52:32Z

We currently have GOMAXPROCS set, so we should be good there 👍

We did try setting --max-servings-threads with v3.7.1 which seemed to resolve infrequent oomkills (overall slightly decreased memory usage), but caused validation request duration to increase to the point of falling back to the webhook failureMode. Would be good to try when we approach upgrading again, for sure.

tehlers320 · 2022-07-15T03:28:14Z

I think ours is just processing stuff that doesn't even have a constraint related to it. I tinkered with admissionwebhook changing it from * to a more explicit list and memory has reduced 500mb per pid. But why is this even processing things that have no rules? I wrote a quick-n-dirty admissionwebhook and its sitting at 100m cpu and 125m memory receiving everything (but doing nothing). So GK must be receiving everything but also running it thru all constraints.

Here are some example of things i dont see anybody ever having a rule for which are at the top.

 kubectl -n gatekeeper-system logs  gatekeeper-controller-manager-8c874cbc4-bnl9j |grep received |tr ' ' '\n' |grep Kind |sort |uniq -c |sort -nr
 598 Kind=Lease",
 322 Kind=Event",

grosser · 2022-07-15T03:52:56Z

Fyi for our setup I added a validation that always makes sure the admission we hook is in sync with the resources the constraints need so there is no overhead or unenforced constraints

…

On Fri, Jul 15, 2022, 10:28 AM Timothy Ehlers ***@***.***> wrote: I think ours is just processing stuff that doesn't even have a constraint related to it. I tinkered with admissionwebhook changing it from * to a more explicit list and memory has reduced 500mb per pid. But why is this even processing things that have no rules? I wrote a quick-n-dirty admissionwebhook and its sitting at 100m cpu and 125m memory receiving everything (but doing nothing). So GK must be receiving everything but also running it thru all constraints. Here are some example of things i dont see anybody ever having a rule for which are at the top. kubectl -n gatekeeper-system logs gatekeeper-controller-manager-8c874cbc4-bnl9j |grep received |tr ' ' '\n' |grep Kind |sort |uniq -c |sort -nr 598 Kind=Lease", 322 Kind=Event", — Reply to this email directly, view it on GitHub <#1703 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACYZ3KJCC2SFJP6IWGXCDVUDLFRANCNFSM5JAJLLVQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ritazh · 2022-07-15T05:39:25Z

Thanks for sharing this data @tehlers320!

I tinkered with admissionwebhook changing it from * to a more explicit list and memory has reduced 500mb per pid.

Was this with constraints? If so, how many constraints and constraint templates?

Fyi for our setup I added a validation that always makes sure the admission
we hook is in sync with the resources the constraints need so there is no
overhead or unenforced constraints

@grosser Where did you add this validation?

grosser · 2022-07-15T06:08:54Z

In our custom code that generates the webhook

…

On Fri, Jul 15, 2022, 12:39 PM Rita Zhang ***@***.***> wrote: Fyi for our setup I added a validation that always makes sure the admission we hook is in sync with the resources the constraints need so there is no overhead or unenforced constraints Where did you add this validation? — Reply to this email directly, view it on GitHub <#1703 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACYZ42IRDN4VV653VWIZ3VUD2RRANCNFSM5JAJLLVQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

tehlers320 · 2022-07-20T03:05:05Z

@ritazh

 kubectl -n gatekeeper-system get constrainttemplates.templates.gatekeeper.sh |wc -l
      21

kubectl -n gatekeeper-system get constrainttemplates.templates.gatekeeper.sh -o yaml |wc -l
    1666

ls  gatekeeper-constraints-chart/templates/*/*yaml |wc -l
      22

cat gatekeeper-constraints-chart/templates/*/*yaml |wc -l
     529

im not sure if we are super complex.

ritazh · 2022-07-20T05:27:57Z

you can see how many constraints (policies) you have with something like this:
kubectl get constraints | grep gatekeeper.sh | wc -l

Another thing to evaluate is do all constraints have match kinds? if so, it should only evaluate requests that match those kinds. if no match kind is provided, then all requests could be hitting the constraint and constraint template for evaluation.

e.g. This constraint ensures only requests that contain pod will be evaluated against this constraint and constraint template.

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
  name: prod-repo-is-openpolicyagent
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]

stale · 2022-09-18T05:40:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

grosser added the enhancement New feature or request label Nov 29, 2021

willbeason added the gator cmd label May 31, 2022

willbeason modified the milestone: gator beta May 31, 2022

stale bot added the stale label Sep 18, 2022

stale bot closed this as completed Oct 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make memory bloat debuggable #1703

make memory bloat debuggable #1703

grosser commented Nov 29, 2021

maxsmythe commented Nov 30, 2021

willbeason commented Nov 30, 2021

grosser commented Nov 30, 2021

willbeason commented Nov 30, 2021

grosser commented Jan 10, 2022

maxsmythe commented Jan 12, 2022

grosser commented Jan 12, 2022 via email

willbeason commented Jun 3, 2022

grosser commented Jun 3, 2022

notchairmk commented Jun 3, 2022

maxsmythe commented Jun 3, 2022

maxsmythe commented Jun 3, 2022 •

edited

Loading

notchairmk commented Jun 3, 2022

tehlers320 commented Jul 15, 2022

grosser commented Jul 15, 2022 via email

ritazh commented Jul 15, 2022 •

edited

Loading

grosser commented Jul 15, 2022 via email

tehlers320 commented Jul 20, 2022

ritazh commented Jul 20, 2022

stale bot commented Sep 18, 2022

make memory bloat debuggable #1703

make memory bloat debuggable #1703

Comments

grosser commented Nov 29, 2021

maxsmythe commented Nov 30, 2021

willbeason commented Nov 30, 2021

grosser commented Nov 30, 2021

willbeason commented Nov 30, 2021

grosser commented Jan 10, 2022

maxsmythe commented Jan 12, 2022

grosser commented Jan 12, 2022 via email

willbeason commented Jun 3, 2022

grosser commented Jun 3, 2022

notchairmk commented Jun 3, 2022

maxsmythe commented Jun 3, 2022

maxsmythe commented Jun 3, 2022 • edited Loading

notchairmk commented Jun 3, 2022

tehlers320 commented Jul 15, 2022

grosser commented Jul 15, 2022 via email

ritazh commented Jul 15, 2022 • edited Loading

grosser commented Jul 15, 2022 via email

tehlers320 commented Jul 20, 2022

ritazh commented Jul 20, 2022

stale bot commented Sep 18, 2022

maxsmythe commented Jun 3, 2022 •

edited

Loading

ritazh commented Jul 15, 2022 •

edited

Loading