Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1545: Expose a counter for BPF hashmap update packets drop #304

Merged
merged 1 commit into from
Mar 28, 2024

Conversation

msherif1234
Copy link
Contributor

@msherif1234 msherif1234 commented Mar 25, 2024

Description

cilium doesn't seem to have a way to read globals from ebpf program I will keep this PR as draft till we have a way to read global from userspace

https://cilium.slack.com/archives/C027KBX679U/p1711370379468299

so perCPU array map will be used to hold the global counter and userspace will read, aggregate and update the metrics

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 25, 2024

@msherif1234: This pull request references NETOBSERV-1545 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

cilium doesn't seem to have a way to read globals from ebpf program I will keep this PR as draft till we have a way to read global from userspace

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 26, 2024

@msherif1234: This pull request references NETOBSERV-1545 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

cilium doesn't seem to have a way to read globals from ebpf program I will keep this PR as draft till we have a way to read global from userspace

https://cilium.slack.com/archives/C027KBX679U/p1711370379468299

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@msherif1234 msherif1234 changed the title WIP: NETOBSERV-1545: Expose a counter of BPF drops NETOBSERV-1545: Expose a counter of BPF drops Mar 26, 2024
@msherif1234 msherif1234 requested a review from jotak March 26, 2024 11:47
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 26, 2024

@msherif1234: This pull request references NETOBSERV-1545 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

cilium doesn't seem to have a way to read globals from ebpf program I will keep this PR as draft till we have a way to read global from userspace

https://cilium.slack.com/archives/C027KBX679U/p1711370379468299

so perCPU array map will be used to hold the global counter and userspace will read, aggregate and update the metrics

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@msherif1234 msherif1234 force-pushed the update_err_counter branch 2 times, most recently from ba261e4 to c0b7a79 Compare March 26, 2024 12:02
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 26, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:add77ab

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=add77ab make set-agent-image

@msherif1234
Copy link
Contributor Author

/ok-to-test

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 26, 2024
@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 26, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 26, 2024

Codecov Report

Attention: Patch coverage is 0% with 18 lines in your changes are missing coverage. Please review.

Project coverage is 33.88%. Comparing base (a5bcf49) to head (a2ba1b4).

❗ Current head a2ba1b4 differs from pull request most recent head fde93e1. Consider uploading reports for the commit fde93e1 to get more accurate results

Files Patch % Lines
pkg/ebpf/tracer.go 0.00% 16 Missing ⚠️
pkg/ebpf/bpf_x86_bpfel.go 0.00% 1 Missing ⚠️
pkg/ebpf/tracer_legacy.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #304      +/-   ##
==========================================
- Coverage   34.04%   33.88%   -0.16%     
==========================================
  Files          47       47              
  Lines        3836     3854      +18     
==========================================
  Hits         1306     1306              
- Misses       2444     2462      +18     
  Partials       86       86              
Flag Coverage Δ
unittests 33.88% <0.00%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:8a1158f

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=8a1158f make set-agent-image

@msherif1234
Copy link
Contributor Author

msherif1234 commented Mar 26, 2024

To emulate error condition :

  • set sampling rate to 1
  • set cacheMaxFlows to 100
  • run ./hey-ho.sh -r 5 -d 3 -z 10m -n 4 -q 2 -p -b
    image

@msherif1234
Copy link
Contributor Author

/gh pr ready

@msherif1234 msherif1234 marked this pull request as ready for review March 26, 2024 14:24
@msherif1234 msherif1234 changed the title NETOBSERV-1545: Expose a counter of BPF drops NETOBSERV-1545: Expose a counter for BPF hashmap update packets drop Mar 26, 2024
bpf_printk("error updating flow %d\n", ret);
}
// Update global counter for hashmap update errors
error_counter_p = bpf_map_lookup_elem(&global_counters, &key);
Copy link
Member

@jotak jotak Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make sure if I understand that: you created global_counters as a map for generic purpose, ie. today it contains drop counters at key 0 but potentially later we may add more counters at different indexes, is this correct?
Perhaps if so you could define a constant DROP_COUNTER_KEY = 0 or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and I am already thinking of using this to count filtered flows :) at different index as well as ovs dbg I will add const here and the golang side Thanks

// ReadGlobalCounter reads the global counter and updates hashmap update error counter metrics
func (m *FlowFetcher) ReadGlobalCounter(met *metrics.Metrics) {
var allCPUValue []uint32
key := uint32(0)
Copy link
Member

@jotak jotak Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well this key could be a constant like dropCounterKey = 0
It would make it more obvious that we can add more keys for more counters

}
// aggregate all the counters
for _, counter := range allCPUValue {
met.Errors.WithErrorName("flow-fetcher", "CannotUpdateHashMapCounter").Add(float64(counter))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a metric for drops, could we use it instead?

Suggested change
met.Errors.WithErrorName("flow-fetcher", "CannotUpdateHashMapCounter").Add(float64(counter))
met.DroppedFlowsCounter.WithSourceAndReason("flow-fetcher", "CannotUpdateHashMapCounter").Add(float64(counter))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG will use it then

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 28, 2024
@msherif1234 msherif1234 requested a review from jotak March 28, 2024 13:07
Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>
Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @msherif1234 !

@msherif1234
Copy link
Contributor Author

/approve

Copy link

openshift-ci bot commented Mar 28, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit b63f1dd into netobserv:main Mar 28, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants