NETOBSERV-1107: optimize ebpf agent map memory and cpu usage #140

msherif1234 · 2023-06-21T14:33:49Z

commit1:

switch to use pointer to metric instead of metric
manual trigger GC after flow eviction complete

commit2:

remove CO-RE disable workaround added by NETOBSERV-1091: remove CO-RE file and extensions as that causes douple allocations #133 and FlushKernelSpec

msherif1234 · 2023-06-21T14:34:04Z

/ok-to-test

github-actions · 2023-06-21T14:35:34Z

New image: quay.io/netobserv/netobserv-ebpf-agent:ec12aa3. It will expire after two weeks.

codecov · 2023-06-21T14:40:32Z

Codecov Report

Merging #140 (c0be2e2) into main (96e8f61) will increase coverage by 0.04%.
The diff coverage is 40.00%.

❗ Current head c0be2e2 differs from pull request most recent head 91e41fd. Consider uploading reports for the commit 91e41fd to get more accurate results

@@            Coverage Diff             @@
##             main     #140      +/-   ##
==========================================
+ Coverage   40.60%   40.65%   +0.04%     
==========================================
  Files          31       31              
  Lines        2054     2054              
==========================================
+ Hits          834      835       +1     
+ Misses       1181     1180       -1     
  Partials       39       39

Flag	Coverage Δ
unittests	`40.65% <40.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/agent/agent.go	`39.20% <ø> (ø)`
pkg/ebpf/tracer.go	`0.00% <0.00%> (ø)`
pkg/exporter/ipfix.go	`0.00% <0.00%> (ø)`
pkg/flow/tracer_map.go	`83.33% <100.00%> (+0.25%)`	⬆️
pkg/test/tracer_fake.go	`67.85% <100.00%> (ø)`

msherif1234 · 2023-06-21T17:17:50Z

/ok-to-test

github-actions · 2023-06-21T17:19:13Z

New image: quay.io/netobserv/netobserv-ebpf-agent:741db71. It will expire after two weeks.

msherif1234 · 2023-06-21T19:22:52Z

/ok-to-test

github-actions · 2023-06-21T19:24:20Z

New image: quay.io/netobserv/netobserv-ebpf-agent:a38c27e. It will expire after two weeks.

openshift-ci-robot · 2023-06-21T19:39:10Z

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

switch to use pointer to metric instead of metric

manuall trigger GC after flow eviction complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msherif1234 · 2023-06-21T20:35:50Z

/ok-to-test

github-actions · 2023-06-21T20:37:26Z

New image: quay.io/netobserv/netobserv-ebpf-agent:0f9bc91. It will expire after two weeks.

msherif1234 · 2023-06-21T21:00:21Z

/ok-to-test

github-actions · 2023-06-21T21:01:59Z

New image: quay.io/netobserv/netobserv-ebpf-agent:39e285a. It will expire after two weeks.

openshift-ci-robot · 2023-06-22T01:59:44Z

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

commit1:

switch to use pointer to metric instead of metric

manuall trigger GC after flow eviction complete

use BatchLookupAndDelete instead of iterate loop with Delete its supported
commit2:

remove CO-RE disable workaround and FlushKernelSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-06-22T02:00:02Z

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

commit1:

switch to use pointer to metric instead of metric

manuall trigger GC after flow eviction complete

use BatchLookupAndDelete instead of iterate loop with Delete its supported

commit2:

remove CO-RE disable workaround and FlushKernelSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-06-22T02:00:40Z

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

commit1:

switch to use pointer to metric instead of metric

manuall trigger GC after flow eviction complete

use BatchLookupAndDelete instead of iterate loop with Delete its supported

commit2:

remove CO-RE disable workaround added by NETOBSERV-1091: remove CO-RE file and extensions as that causes douple allocations #133 and FlushKernelSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-06-22T02:39:36Z

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

commit1:

switch to use pointer to metric instead of metric

manual trigger GC after flow eviction complete

use BatchLookupAndDelete instead of iterate loop with Delete its supported

commit2:

remove CO-RE disable workaround added by NETOBSERV-1091: remove CO-RE file and extensions as that causes douple allocations #133 and FlushKernelSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jpinsonneau · 2023-06-22T07:20:57Z

pkg/flow/tracer_map.go

@@ -115,5 +116,6 @@ func (m *MapTracer) evictFlows(ctx context.Context, forwardFlows chan<- []*Recor
 	default:
 		forwardFlows <- forwardingFlows
 	}
+	runtime.GC() // Triggers a manual GC


I'm not sure about this manual call. For me the GOMEMLIMIT is supposed to be good enough if properly set.

Would it be interesting to make this configurable ?
We can expose it in debug section of the CRD.

manual call because under normal run not close to resource limit GOMEMLIMIT won't trigger GC so we will keep leaking flows map memory causing the memory to keep building up this manual call to prevent this memory buildup IMO

I agree with Julien, we know it has a CPU cost, having memory piling up is not bad per se as long as it's in the limit bounds configured.
GOMEMLIMIT should be sufficient to avoid OOM due to lack of gc calls, per my understanding.
But +1 to keep this option configurable so that we can still have a more aggressive GC strategy that we can turn on and off any time.

added envvar to control it and default it to true because we saw immediate memory gain with it at the cost of cpu pump I will explore options to better trim down those resources in the following PRs

jpinsonneau

LGTM in terms of code 👍 thanks @msherif1234 !

openshift-ci-robot · 2023-06-22T20:43:06Z

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

commit1:

switch to use pointer to metric instead of metric

manual trigger GC after flow eviction complete

commit2:

remove CO-RE disable workaround added by NETOBSERV-1091: remove CO-RE file and extensions as that causes douple allocations #133 and FlushKernelSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msherif1234 · 2023-06-22T20:54:56Z

cpu is slightly higher because we trigger manual GC now to help go free memory quicker
mem is slightly less but we actually removed PR NETOBSERV-1091: remove CO-RE file and extensions as that causes douple allocations #133 work around which was increasing both memory and cpu that is why it hided around 25% memory reduction

msherif1234 · 2023-06-22T23:21:43Z

/ok-to-test

github-actions · 2023-06-22T23:23:01Z

New image: quay.io/netobserv/netobserv-ebpf-agent:faf35e9. It will expire after two weeks.

openshift-ci-robot · 2023-06-23T11:14:47Z

@msherif1234: This pull request references NETOBSERV-1103 which is a valid jira issue.

In response to this:

commit1:

switch to use pointer to metric instead of metric

manual trigger GC after flow eviction complete

commit2:

remove CO-RE disable workaround added by NETOBSERV-1091: remove CO-RE file and extensions as that causes douple allocations #133 and FlushKernelSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jotak · 2023-06-23T12:05:10Z

pkg/agent/config.go

@@ -136,4 +136,6 @@ type Config struct {
 	KafkaSASLClientSecretPath string `env:"KAFKA_SASL_CLIENT_SECRET_PATH"`
 	// ProfilePort sets the listening port for Go's Pprof tool. If it is not set, profile is disabled
 	ProfilePort int `env:"PROFILE_PORT"`
+	// GoMemLimit sets soft memory cap to ebpf agent process
+	GoMemLimit string `env:"GOMEM_LIMIT"`


Why not using directly the GOMEMLIMIT env? That would avoid having to sync the env and the config. You would have it directly set by the operator / CRD, exactly like we already can do with other GO env such as GOGC.

I thought we wanted to control this value from operator ? operator and agent are two different processes so not sure I understand ur comment pls clarify

I get it operator will set the envvar nothing needed here I will remove this piece

- switch to use pointer to metric instead of metric - manuall trigger GC after flow eviction complete Signed-off-by: msherif1234 <mmahmoud@redhat.com>

following up on cilium/ebpf#1063 it seems we have a way to fix resources issues Signed-off-by: msherif1234 <mmahmoud@redhat.com> (cherry picked from commit b9c9a03)

jotak · 2023-06-23T16:07:07Z

/lgtm
thanks!

msherif1234 · 2023-06-23T16:24:53Z

/approve

openshift-ci · 2023-06-23T16:25:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [msherif1234]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the do-not-merge/work-in-progress label Jun 21, 2023

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023

msherif1234 mentioned this pull request Jun 21, 2023

WIP Fix memory and cpu scale issue work around in #133 #135

Closed

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023

msherif1234 changed the title ~~WIP: optimize ebpf agent map memory usage~~ WIP: NETOBSERV-1107: optimize ebpf agent map memory usage Jun 21, 2023

openshift-ci-robot added the jira/valid-reference label Jun 21, 2023

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023

msherif1234 changed the title ~~WIP: NETOBSERV-1107: optimize ebpf agent map memory usage~~ WIP: NETOBSERV-1107: optimize ebpf agent map memory and cpu usage Jun 22, 2023

jpinsonneau reviewed Jun 22, 2023

View reviewed changes

jpinsonneau previously approved these changes Jun 22, 2023

View reviewed changes

openshift-ci bot assigned jpinsonneau Jun 22, 2023

openshift-ci bot added the lgtm label Jun 22, 2023

msherif1234 dismissed jpinsonneau’s stale review via d5ff511 June 22, 2023 16:24

openshift-ci bot removed the lgtm label Jun 22, 2023

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 22, 2023

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 22, 2023

msherif1234 changed the title ~~WIP: NETOBSERV-1107: optimize ebpf agent map memory and cpu usage~~ NETOBSERV-1107: optimize ebpf agent map memory and cpu usage Jun 23, 2023

openshift-ci bot removed the do-not-merge/work-in-progress label Jun 23, 2023

msherif1234 changed the title ~~NETOBSERV-1107: optimize ebpf agent map memory and cpu usage~~ NETOBSERV-1103: optimize ebpf agent map memory and cpu usage Jun 23, 2023

jotak reviewed Jun 23, 2023

View reviewed changes

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 23, 2023

msherif1234 added 2 commits June 23, 2023 09:30

Optimize ebpf agent map memory usage

e694b1c

- switch to use pointer to metric instead of metric - manuall trigger GC after flow eviction complete Signed-off-by: msherif1234 <mmahmoud@redhat.com>

Fix memory and cpu scale issue work around in #133

91e41fd

following up on cilium/ebpf#1063 it seems we have a way to fix resources issues Signed-off-by: msherif1234 <mmahmoud@redhat.com> (cherry picked from commit b9c9a03)

openshift-ci bot assigned jotak Jun 23, 2023

openshift-ci bot added the lgtm label Jun 23, 2023

openshift-ci bot added the approved label Jun 23, 2023

openshift-merge-robot merged commit 2d63d90 into netobserv:main Jun 23, 2023
9 checks passed

msherif1234 changed the title ~~NETOBSERV-1103: optimize ebpf agent map memory and cpu usage~~ NETOBSERV-1107: optimize ebpf agent map memory and cpu usage Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NETOBSERV-1107: optimize ebpf agent map memory and cpu usage #140

NETOBSERV-1107: optimize ebpf agent map memory and cpu usage #140

msherif1234 commented Jun 21, 2023 •

edited

msherif1234 commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

codecov bot commented Jun 21, 2023 •

edited

msherif1234 commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

msherif1234 commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

openshift-ci-robot commented Jun 21, 2023 •

edited by openshift-ci bot

msherif1234 commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

msherif1234 commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

openshift-ci-robot commented Jun 22, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 •

edited by openshift-ci bot

jpinsonneau Jun 22, 2023

msherif1234 Jun 22, 2023

jotak Jun 23, 2023

msherif1234 Jun 23, 2023

jpinsonneau left a comment

openshift-ci-robot commented Jun 22, 2023 •

edited by openshift-ci bot

msherif1234 commented Jun 22, 2023 •

edited

msherif1234 commented Jun 22, 2023

github-actions bot commented Jun 22, 2023

openshift-ci-robot commented Jun 23, 2023 •

edited by openshift-ci bot

jotak Jun 23, 2023 •

edited

msherif1234 Jun 23, 2023 •

edited

msherif1234 Jun 23, 2023

jotak commented Jun 23, 2023

msherif1234 commented Jun 23, 2023

openshift-ci bot commented Jun 23, 2023

NETOBSERV-1107: optimize ebpf agent map memory and cpu usage #140

NETOBSERV-1107: optimize ebpf agent map memory and cpu usage #140

Conversation

msherif1234 commented Jun 21, 2023 • edited

msherif1234 commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

codecov bot commented Jun 21, 2023 • edited

Codecov Report

msherif1234 commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

msherif1234 commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

openshift-ci-robot commented Jun 21, 2023 • edited by openshift-ci bot

msherif1234 commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

msherif1234 commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

openshift-ci-robot commented Jun 22, 2023 • edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 • edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 • edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 • edited by openshift-ci bot

jpinsonneau Jun 22, 2023

Choose a reason for hiding this comment

msherif1234 Jun 22, 2023

Choose a reason for hiding this comment

jotak Jun 23, 2023

Choose a reason for hiding this comment

msherif1234 Jun 23, 2023

Choose a reason for hiding this comment

jpinsonneau left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Jun 22, 2023 • edited by openshift-ci bot

msherif1234 commented Jun 22, 2023 • edited

msherif1234 commented Jun 22, 2023

github-actions bot commented Jun 22, 2023

openshift-ci-robot commented Jun 23, 2023 • edited by openshift-ci bot

jotak Jun 23, 2023 • edited

Choose a reason for hiding this comment

msherif1234 Jun 23, 2023 • edited

Choose a reason for hiding this comment

msherif1234 Jun 23, 2023

Choose a reason for hiding this comment

jotak commented Jun 23, 2023

msherif1234 commented Jun 23, 2023

openshift-ci bot commented Jun 23, 2023

msherif1234 commented Jun 21, 2023 •

edited

codecov bot commented Jun 21, 2023 •

edited

openshift-ci-robot commented Jun 21, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Jun 22, 2023 •

edited by openshift-ci bot

msherif1234 commented Jun 22, 2023 •

edited

openshift-ci-robot commented Jun 23, 2023 •

edited by openshift-ci bot

jotak Jun 23, 2023 •

edited

msherif1234 Jun 23, 2023 •

edited