-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-1107: optimize ebpf agent map memory and cpu usage #140
Conversation
/ok-to-test |
New image: quay.io/netobserv/netobserv-ebpf-agent:ec12aa3. It will expire after two weeks. |
Codecov Report
@@ Coverage Diff @@
## main #140 +/- ##
==========================================
+ Coverage 40.60% 40.65% +0.04%
==========================================
Files 31 31
Lines 2054 2054
==========================================
+ Hits 834 835 +1
+ Misses 1181 1180 -1
Partials 39 39
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/ok-to-test |
New image: quay.io/netobserv/netobserv-ebpf-agent:741db71. It will expire after two weeks. |
/ok-to-test |
New image: quay.io/netobserv/netobserv-ebpf-agent:a38c27e. It will expire after two weeks. |
@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
New image: quay.io/netobserv/netobserv-ebpf-agent:0f9bc91. It will expire after two weeks. |
/ok-to-test |
New image: quay.io/netobserv/netobserv-ebpf-agent:39e285a. It will expire after two weeks. |
@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/flow/tracer_map.go
Outdated
@@ -115,5 +116,6 @@ func (m *MapTracer) evictFlows(ctx context.Context, forwardFlows chan<- []*Recor | |||
default: | |||
forwardFlows <- forwardingFlows | |||
} | |||
runtime.GC() // Triggers a manual GC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this manual call. For me the GOMEMLIMIT
is supposed to be good enough if properly set.
Would it be interesting to make this configurable ?
We can expose it in debug section of the CRD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manual call because under normal run not close to resource limit GOMEMLIMIT won't trigger GC so we will keep leaking flows map memory causing the memory to keep building up this manual call to prevent this memory buildup IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Julien, we know it has a CPU cost, having memory piling up is not bad per se as long as it's in the limit bounds configured.
GOMEMLIMIT should be sufficient to avoid OOM due to lack of gc calls, per my understanding.
But +1 to keep this option configurable so that we can still have a more aggressive GC strategy that we can turn on and off any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added envvar to control it and default it to true because we saw immediate memory gain with it at the cost of cpu pump I will explore options to better trim down those resources in the following PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in terms of code 👍 thanks @msherif1234 !
@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test |
New image: quay.io/netobserv/netobserv-ebpf-agent:faf35e9. It will expire after two weeks. |
@msherif1234: This pull request references NETOBSERV-1103 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/agent/config.go
Outdated
@@ -136,4 +136,6 @@ type Config struct { | |||
KafkaSASLClientSecretPath string `env:"KAFKA_SASL_CLIENT_SECRET_PATH"` | |||
// ProfilePort sets the listening port for Go's Pprof tool. If it is not set, profile is disabled | |||
ProfilePort int `env:"PROFILE_PORT"` | |||
// GoMemLimit sets soft memory cap to ebpf agent process | |||
GoMemLimit string `env:"GOMEM_LIMIT"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using directly the GOMEMLIMIT env? That would avoid having to sync the env and the config. You would have it directly set by the operator / CRD, exactly like we already can do with other GO env such as GOGC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we wanted to control this value from operator ? operator and agent are two different processes so not sure I understand ur comment pls clarify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it operator will set the envvar nothing needed here I will remove this piece
- switch to use pointer to metric instead of metric - manuall trigger GC after flow eviction complete Signed-off-by: msherif1234 <mmahmoud@redhat.com>
following up on cilium/ebpf#1063 it seems we have a way to fix resources issues Signed-off-by: msherif1234 <mmahmoud@redhat.com> (cherry picked from commit b9c9a03)
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msherif1234 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
commit1:
commit2:
FlushKernelSpec