Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1107: optimize ebpf agent map memory and cpu usage #140

Merged
merged 2 commits into from
Jun 23, 2023
Merged

NETOBSERV-1107: optimize ebpf agent map memory and cpu usage #140

merged 2 commits into from
Jun 23, 2023

Conversation

msherif1234
Copy link
Contributor

@msherif1234 msherif1234 commented Jun 21, 2023

commit1:

  • switch to use pointer to metric instead of metric
  • manual trigger GC after flow eviction complete

commit2:

@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023
@github-actions
Copy link

New image: quay.io/netobserv/netobserv-ebpf-agent:ec12aa3. It will expire after two weeks.

@codecov
Copy link

codecov bot commented Jun 21, 2023

Codecov Report

Merging #140 (c0be2e2) into main (96e8f61) will increase coverage by 0.04%.
The diff coverage is 40.00%.

❗ Current head c0be2e2 differs from pull request most recent head 91e41fd. Consider uploading reports for the commit 91e41fd to get more accurate results

@@            Coverage Diff             @@
##             main     #140      +/-   ##
==========================================
+ Coverage   40.60%   40.65%   +0.04%     
==========================================
  Files          31       31              
  Lines        2054     2054              
==========================================
+ Hits          834      835       +1     
+ Misses       1181     1180       -1     
  Partials       39       39              
Flag Coverage Δ
unittests 40.65% <40.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/agent.go 39.20% <ø> (ø)
pkg/ebpf/tracer.go 0.00% <0.00%> (ø)
pkg/exporter/ipfix.go 0.00% <0.00%> (ø)
pkg/flow/tracer_map.go 83.33% <100.00%> (+0.25%) ⬆️
pkg/test/tracer_fake.go 67.85% <100.00%> (ø)

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023
@github-actions
Copy link

New image: quay.io/netobserv/netobserv-ebpf-agent:741db71. It will expire after two weeks.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023
@github-actions
Copy link

New image: quay.io/netobserv/netobserv-ebpf-agent:a38c27e. It will expire after two weeks.

@msherif1234 msherif1234 changed the title WIP: optimize ebpf agent map memory usage WIP: NETOBSERV-1107: optimize ebpf agent map memory usage Jun 21, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 21, 2023

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

  • switch to use pointer to metric instead of metric
  • manuall trigger GC after flow eviction complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023
@github-actions
Copy link

New image: quay.io/netobserv/netobserv-ebpf-agent:0f9bc91. It will expire after two weeks.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 21, 2023
@github-actions
Copy link

New image: quay.io/netobserv/netobserv-ebpf-agent:39e285a. It will expire after two weeks.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 22, 2023

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

commit1:

  • switch to use pointer to metric instead of metric
  • manuall trigger GC after flow eviction complete
  • use BatchLookupAndDelete instead of iterate loop with Delete its supported
    commit2:
  • remove CO-RE disable workaround and FlushKernelSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 22, 2023

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

commit1:

  • switch to use pointer to metric instead of metric
  • manuall trigger GC after flow eviction complete
  • use BatchLookupAndDelete instead of iterate loop with Delete its supported

commit2:

  • remove CO-RE disable workaround and FlushKernelSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 22, 2023

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

commit1:

  • switch to use pointer to metric instead of metric
  • manuall trigger GC after flow eviction complete
  • use BatchLookupAndDelete instead of iterate loop with Delete its supported

commit2:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msherif1234 msherif1234 changed the title WIP: NETOBSERV-1107: optimize ebpf agent map memory usage WIP: NETOBSERV-1107: optimize ebpf agent map memory and cpu usage Jun 22, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 22, 2023

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

commit1:

  • switch to use pointer to metric instead of metric
  • manual trigger GC after flow eviction complete
  • use BatchLookupAndDelete instead of iterate loop with Delete its supported

commit2:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -115,5 +116,6 @@ func (m *MapTracer) evictFlows(ctx context.Context, forwardFlows chan<- []*Recor
default:
forwardFlows <- forwardingFlows
}
runtime.GC() // Triggers a manual GC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this manual call. For me the GOMEMLIMIT is supposed to be good enough if properly set.

Would it be interesting to make this configurable ?
We can expose it in debug section of the CRD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manual call because under normal run not close to resource limit GOMEMLIMIT won't trigger GC so we will keep leaking flows map memory causing the memory to keep building up this manual call to prevent this memory buildup IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Julien, we know it has a CPU cost, having memory piling up is not bad per se as long as it's in the limit bounds configured.
GOMEMLIMIT should be sufficient to avoid OOM due to lack of gc calls, per my understanding.
But +1 to keep this option configurable so that we can still have a more aggressive GC strategy that we can turn on and off any time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added envvar to control it and default it to true because we saw immediate memory gain with it at the cost of cpu pump I will explore options to better trim down those resources in the following PRs

jpinsonneau
jpinsonneau previously approved these changes Jun 22, 2023
Copy link
Contributor

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in terms of code 👍 thanks @msherif1234 !

@openshift-ci openshift-ci bot added the lgtm label Jun 22, 2023
@openshift-ci openshift-ci bot removed the lgtm label Jun 22, 2023
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 22, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 22, 2023

@msherif1234: This pull request references NETOBSERV-1107 which is a valid jira issue.

In response to this:

commit1:

  • switch to use pointer to metric instead of metric
  • manual trigger GC after flow eviction complete

commit2:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msherif1234
Copy link
Contributor Author

msherif1234 commented Jun 22, 2023

Screenshot from 2023-06-22 16-52-45

@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 22, 2023
@github-actions
Copy link

New image: quay.io/netobserv/netobserv-ebpf-agent:faf35e9. It will expire after two weeks.

@msherif1234 msherif1234 changed the title WIP: NETOBSERV-1107: optimize ebpf agent map memory and cpu usage NETOBSERV-1107: optimize ebpf agent map memory and cpu usage Jun 23, 2023
@msherif1234 msherif1234 changed the title NETOBSERV-1107: optimize ebpf agent map memory and cpu usage NETOBSERV-1103: optimize ebpf agent map memory and cpu usage Jun 23, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 23, 2023

@msherif1234: This pull request references NETOBSERV-1103 which is a valid jira issue.

In response to this:

commit1:

  • switch to use pointer to metric instead of metric
  • manual trigger GC after flow eviction complete

commit2:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -136,4 +136,6 @@ type Config struct {
KafkaSASLClientSecretPath string `env:"KAFKA_SASL_CLIENT_SECRET_PATH"`
// ProfilePort sets the listening port for Go's Pprof tool. If it is not set, profile is disabled
ProfilePort int `env:"PROFILE_PORT"`
// GoMemLimit sets soft memory cap to ebpf agent process
GoMemLimit string `env:"GOMEM_LIMIT"`
Copy link
Member

@jotak jotak Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using directly the GOMEMLIMIT env? That would avoid having to sync the env and the config. You would have it directly set by the operator / CRD, exactly like we already can do with other GO env such as GOGC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted to control this value from operator ? operator and agent are two different processes so not sure I understand ur comment pls clarify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it operator will set the envvar nothing needed here I will remove this piece

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 23, 2023
- switch to use pointer to metric instead of metric
- manuall trigger GC after flow eviction complete

Signed-off-by: msherif1234 <mmahmoud@redhat.com>
following up on cilium/ebpf#1063
it seems we have a way to fix resources issues

Signed-off-by: msherif1234 <mmahmoud@redhat.com>
(cherry picked from commit b9c9a03)
@jotak
Copy link
Member

jotak commented Jun 23, 2023

/lgtm
thanks!

@msherif1234
Copy link
Contributor Author

/approve

@openshift-ci
Copy link

openshift-ci bot commented Jun 23, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 2d63d90 into netobserv:main Jun 23, 2023
9 checks passed
@msherif1234 msherif1234 changed the title NETOBSERV-1103: optimize ebpf agent map memory and cpu usage NETOBSERV-1107: optimize ebpf agent map memory and cpu usage Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants