Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-557: add eBPF agent metrics for troubleshooting #263

Merged
merged 1 commit into from
Feb 21, 2024

Conversation

msherif1234
Copy link
Contributor

@msherif1234 msherif1234 commented Feb 7, 2024

Description

add promo metrics to eBPF agent with the ability to export metrics to promo Server

unit-test

tested locally using standalone ebpf-agent

  • sudo LOG_LEVEL=debug FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 METRICS_PROMO_ENABLE="true" ./bin/netobserv-ebpf-agent
  • from another terminal curl 127.0.0.1:9090/metrics | grep ebpf_agent
    results
:-- # TYPE ebpf_agent_err_can_not_delete_flow_entries counter
-ebpf_agent_err_can_not_delete_flow_entries{operational="errors while deleting flows"} 0
-:# HELP ebpf_agent_err_can_not_write_to_grpc Error can not write to GRPC
--# TYPE ebpf_agent_err_can_not_write_to_grpc counter
:-ebpf_agent_err_can_not_write_to_grpc{operational="err_export_by_grpc"} 0
- # HELP ebpf_agent_hashmap_evictions Number of hashmap evictions
8# TYPE ebpf_agent_hashmap_evictions counter
8ebpf_agent_hashmap_evictions{operational="hash map evictions"} 16
06# HELP ebpf_agent_number_of_evicted_flows Number of evicted flows
k# TYPE ebpf_agent_number_of_evicted_flows gauge

ebpf_agent_number_of_evicted_flows{operational="number of evicted flows"} 41
# HELP ebpf_agent_number_of_flows_received_via_ring_buffer Number of flows received via ring buffer
# TYPE ebpf_agent_number_of_flows_received_via_ring_buffer gauge
ebpf_agent_number_of_flows_received_via_ring_buffer{operational="number_of_flows_received"} 0
# HELP ebpf_agent_number_of_records_received_by_grpc Number of records received by GRPC
# TYPE ebpf_agent_number_of_records_received_by_grpc counter
ebpf_agent_number_of_records_received_by_grpc{operational="number_of_records_received_by_grpc"} 41
# HELP ebpf_agent_sampling_rate Sampling rate
# TYPE ebpf_agent_sampling_rate gauge
ebpf_agent_sampling_rate{operational="sampling rate"} 50
# HELP ebpf_agent_time_spent_in_lookup_and_delete_map Time spent in lookup and delete map
# TYPE ebpf_agent_time_spent_in_lookup_and_delete_map histogram
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.001"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.01"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.1"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="100"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1000"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10000"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="+Inf"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_sum{operational="time spent in lookup and delete"} 0.005711665
ebpf_agent_time_spent_in_lookup_and_delete_map_count{operational="time spent in lookup and delete"} 16
# HELP ebpf_agent_userspace_evictions Number of userspace evictions
# TYPE ebpf_agent_userspace_evictions counter
ebpf_agent_userspace_evictions{operational="user space evictions"} 0

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Feb 7, 2024

@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

add promo metrics to eBPF agent with the ability to export metrics to promo Server

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@msherif1234 msherif1234 force-pushed the promo-stats branch 3 times, most recently from a5e0ae2 to 01f6dac Compare February 7, 2024 19:52
Copy link

codecov bot commented Feb 7, 2024

Codecov Report

Attention: 62 lines in your changes are missing coverage. Please review.

Comparison is base (349fd30) 33.53% compared to head (77d43d1) 35.90%.

Files Patch % Lines
pkg/agent/agent.go 39.53% 25 Missing and 1 partial ⚠️
pkg/prometheus/prom_server.go 70.58% 14 Missing and 6 partials ⚠️
pkg/metrics/metrics.go 91.20% 6 Missing and 2 partials ⚠️
pkg/ebpf/tracer.go 0.00% 4 Missing ⚠️
pkg/flow/tracer_ringbuf.go 66.66% 2 Missing ⚠️
pkg/exporter/grpc_proto.go 90.90% 1 Missing ⚠️
pkg/exporter/kafka_proto.go 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #263      +/-   ##
==========================================
+ Coverage   33.53%   35.90%   +2.36%     
==========================================
  Files          40       42       +2     
  Lines        3554     3777     +223     
==========================================
+ Hits         1192     1356     +164     
- Misses       2293     2343      +50     
- Partials       69       78       +9     
Flag Coverage Δ
unittests 35.90% <75.20%> (+2.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@msherif1234 msherif1234 changed the title NETOBSERV-557: add eBPF agent metrics for troubleshooting WIP: NETOBSERV-557: add eBPF agent metrics for troubleshooting Feb 7, 2024
@msherif1234 msherif1234 force-pushed the promo-stats branch 3 times, most recently from 7a12625 to 05e4311 Compare February 8, 2024 14:24
@msherif1234 msherif1234 changed the title WIP: NETOBSERV-557: add eBPF agent metrics for troubleshooting NETOBSERV-557: add eBPF agent metrics for troubleshooting Feb 8, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Feb 8, 2024

@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

add promo metrics to eBPF agent with the ability to export metrics to promo Server

unit-test

tested locally using standalone ebpf-agent

  • sudo LOG_LEVEL=debug FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 METRICS_PROMO_ENABLE="true" ./bin/netobserv-ebpf-agent
  • from another terminal curl 127.0.0.1:9090/metrics | grep ebpf_agent
    results
:-# TYPE ebpf_agent_bytes_submitted_via_grpc counter
-:ebpf_agent_bytes_submitted_via_grpc{operational="bytes_received_by_grpc"} 192
--# HELP ebpf_agent_err_can_not_delete_flow_entries Error can not delete flow entries
-# TYPE ebpf_agent_err_can_not_delete_flow_entries counter
-:-ebpf_agent_err_can_not_delete_flow_entries{operational="errors while deleting flows"} 0
-# HELP ebpf_agent_err_can_not_write_to_grpc Error can not write to GRPC
:-# TYPE ebpf_agent_err_can_not_write_to_grpc counter
- ebpf_agent_err_can_not_write_to_grpc{operational="err_export_by_grpc"} 0
86# HELP ebpf_agent_hashmap_evictions Number of hashmap evictions
23# TYPE ebpf_agent_hashmap_evictions counter
k
ebpf_agent_hashmap_evictions{operational="hash map evictions"} 6
# HELP ebpf_agent_number_of_evicted_flows Number of evicted flows
# TYPE ebpf_agent_number_of_evicted_flows gauge
ebpf_agent_number_of_evicted_flows{operational="number of evicted flows"} 192
# HELP ebpf_agent_number_of_flows_received_via_ring_buffer Number of flows received via ring buffer
# TYPE ebpf_agent_number_of_flows_received_via_ring_buffer gauge
ebpf_agent_number_of_flows_received_via_ring_buffer{operational="number_of_flows_received"} 0
# HELP ebpf_agent_time_spent_in_lookup_and_delete_map Time spent in lookup and delete map
# TYPE ebpf_agent_time_spent_in_lookup_and_delete_map histogram
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.001"} 1
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.01"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.1"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="100"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1000"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10000"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="+Inf"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_sum{operational="time spent in lookup and delete"} 0.013046000999999998
ebpf_agent_time_spent_in_lookup_and_delete_map_count{operational="time spent in lookup and delete"} 6
# HELP ebpf_agent_userspace_evictions Number of userspace evictions
# TYPE ebpf_agent_userspace_evictions counter
ebpf_agent_userspace_evictions{operational="user space evictions"} 0

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Feb 8, 2024

@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

add promo metrics to eBPF agent with the ability to export metrics to promo Server

unit-test

tested locally using standalone ebpf-agent

  • sudo LOG_LEVEL=debug FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 METRICS_PROMO_ENABLE="true" ./bin/netobserv-ebpf-agent
  • from another terminal curl 127.0.0.1:9090/metrics | grep ebpf_agent
    results
:-# TYPE ebpf_agent_bytes_submitted_via_grpc counter
-:ebpf_agent_bytes_submitted_via_grpc{operational="bytes_received_by_grpc"} 192
--# HELP ebpf_agent_err_can_not_delete_flow_entries Error can not delete flow entries
-# TYPE ebpf_agent_err_can_not_delete_flow_entries counter
-:-ebpf_agent_err_can_not_delete_flow_entries{operational="errors while deleting flows"} 0
-# HELP ebpf_agent_err_can_not_write_to_grpc Error can not write to GRPC
:-# TYPE ebpf_agent_err_can_not_write_to_grpc counter
- ebpf_agent_err_can_not_write_to_grpc{operational="err_export_by_grpc"} 0
86# HELP ebpf_agent_hashmap_evictions Number of hashmap evictions
23# TYPE ebpf_agent_hashmap_evictions counter
k
ebpf_agent_hashmap_evictions{operational="hash map evictions"} 6
# HELP ebpf_agent_number_of_evicted_flows Number of evicted flows
# TYPE ebpf_agent_number_of_evicted_flows gauge
ebpf_agent_number_of_evicted_flows{operational="number of evicted flows"} 192
# HELP ebpf_agent_number_of_flows_received_via_ring_buffer Number of flows received via ring buffer
# TYPE ebpf_agent_number_of_flows_received_via_ring_buffer gauge
ebpf_agent_number_of_flows_received_via_ring_buffer{operational="number_of_flows_received"} 0
# HELP ebpf_agent_time_spent_in_lookup_and_delete_map Time spent in lookup and delete map
# TYPE ebpf_agent_time_spent_in_lookup_and_delete_map histogram
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.001"} 1
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.01"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.1"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="100"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1000"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10000"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="+Inf"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_sum{operational="time spent in lookup and delete"} 0.013046000999999998
ebpf_agent_time_spent_in_lookup_and_delete_map_count{operational="time spent in lookup and delete"} 6
# HELP ebpf_agent_userspace_evictions Number of userspace evictions
# TYPE ebpf_agent_userspace_evictions counter
ebpf_agent_userspace_evictions{operational="user space evictions"} 0

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Feb 8, 2024

@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

add promo metrics to eBPF agent with the ability to export metrics to promo Server

unit-test

tested locally using standalone ebpf-agent

  • sudo LOG_LEVEL=debug FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 METRICS_PROMO_ENABLE="true" ./bin/netobserv-ebpf-agent
  • from another terminal curl 127.0.0.1:9090/metrics | grep ebpf_agent
    results
:-- # TYPE ebpf_agent_err_can_not_delete_flow_entries counter
-ebpf_agent_err_can_not_delete_flow_entries{operational="errors while deleting flows"} 0
-:# HELP ebpf_agent_err_can_not_write_to_grpc Error can not write to GRPC
--# TYPE ebpf_agent_err_can_not_write_to_grpc counter
:-ebpf_agent_err_can_not_write_to_grpc{operational="err_export_by_grpc"} 0
- # HELP ebpf_agent_hashmap_evictions Number of hashmap evictions
8# TYPE ebpf_agent_hashmap_evictions counter
8ebpf_agent_hashmap_evictions{operational="hash map evictions"} 16
06# HELP ebpf_agent_number_of_evicted_flows Number of evicted flows
k# TYPE ebpf_agent_number_of_evicted_flows gauge

ebpf_agent_number_of_evicted_flows{operational="number of evicted flows"} 41
# HELP ebpf_agent_number_of_flows_received_via_ring_buffer Number of flows received via ring buffer
# TYPE ebpf_agent_number_of_flows_received_via_ring_buffer gauge
ebpf_agent_number_of_flows_received_via_ring_buffer{operational="number_of_flows_received"} 0
# HELP ebpf_agent_number_of_records_received_by_grpc Number of records received by GRPC
# TYPE ebpf_agent_number_of_records_received_by_grpc counter
ebpf_agent_number_of_records_received_by_grpc{operational="number_of_records_received_by_grpc"} 41
# HELP ebpf_agent_sampling_rate Sampling rate
# TYPE ebpf_agent_sampling_rate gauge
ebpf_agent_sampling_rate{operational="sampling rate"} 50
# HELP ebpf_agent_time_spent_in_lookup_and_delete_map Time spent in lookup and delete map
# TYPE ebpf_agent_time_spent_in_lookup_and_delete_map histogram
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.001"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.01"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.1"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="100"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1000"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10000"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="+Inf"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_sum{operational="time spent in lookup and delete"} 0.005711665
ebpf_agent_time_spent_in_lookup_and_delete_map_count{operational="time spent in lookup and delete"} 16
# HELP ebpf_agent_userspace_evictions Number of userspace evictions
# TYPE ebpf_agent_userspace_evictions counter
ebpf_agent_userspace_evictions{operational="user space evictions"} 0

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 19, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:1ed6690

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=1ed6690 make set-agent-image

Comment on lines 168 to 173
// MetricsPromoEnable enables prometheus server to collect ebpf agent metrics, default is false.
MetricsPromoEnable bool `env:"METRICS_PROMO_ENABLE" envDefault:"false"`
// MetricsPromoServerAddress is the address of the prometheus server that collects ebpf agent metrics.
MetricsPromoServerAddress string `env:"METRICS_PROMO_SERVER_ADDRESS"`
// MetricsPromoPort is the port of the prometheus server that collects ebpf agent metrics.
MetricsPromoPort int `env:"METRICS_PROMO_PORT" envDefault:"9090"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about this "Promo" in the name, I guess it stands for "prometheus" but it's the first time I see it abbreviated like that ... anyway, I'm not sure we need to tell about Prometheus in the variable names, as it is the de-facto standard in k8s anyway. Can we just say "MetricsEnable" (or "EnableMetrics" to be more consistent with the other EnableSomething in the settings?), MetricsServerAddress, etc.

// MetricsPrefix is the prefix of the metrics that are sent to prometheus server.
MetricsPrefix string `env:"METRICS_PREFIX" envDefault:"ebpf_agent_"`
// MetricsNoPanic disables panic on metrics errors, default is false.
MetricsNoPanic bool `env:"METRICS_NO_PANIC" envDefault:"false"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove the NoPanic option, not sure if it really brings something useful?
(In FLP we have this "NoPanic" option mostly because the code was initially written with panics and we didn't want that in the operator ... there's not the same background here)

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 19, 2024
hostPort: hostPort,
clientConn: clientConn,
maxFlowsPerMessage: maxFlowsPerMessage,
numberOfRecordsReceivedByGRPC: m.CreateNumberOfRecordsReceivedByGRPC("number_of_records_received_by_grpc"),
Copy link
Member

@jotak jotak Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a counter named records_written_total, shared across exporters, with a label such as transport or exporter that could be either grpc or kafka or anything else?
Also it may be useful to count records but also the number of batches, so there could be a second metric batches_written_total with again an exporter or transport label.

On naming metrics and labels, it's recommend to read this: https://prometheus.io/docs/practices/naming/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't seem there is a way to findout how many batches been written its just a single send call with all records ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 52 just after for inputRecords := range input { if you increment a batch counter by 1, wouldn't it give the batch count?

clientConn: clientConn,
maxFlowsPerMessage: maxFlowsPerMessage,
numberOfRecordsReceivedByGRPC: m.CreateNumberOfRecordsReceivedByGRPC("number_of_records_received_by_grpc"),
errExportByGRPC: m.CreateErrorCanNotWriteToGRPC("err_export_by_grpc"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also I think we could have a single metric, with the appropriate label, for reporting errors, rather than one different metric per exporter

rbTracer := flow.NewRingBufTracer(fetcher, mapTracer, cfg.CacheActiveTimeout)
m := metrics.NewMetrics(metricsSettings)
samplingGauge := m.CreateSamplingRate("sampling rate")
samplingGauge.Add(float64(cfg.Sampling))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although it is technically the same since the gauge is initialized at 0, I would use .Set rather than .Add, as we are setting a value regardless what was the previous value.

Suggested change
samplingGauge.Add(float64(cfg.Sampling))
samplingGauge.Set(float64(cfg.Sampling))

return c
}

func (m *Metrics) CreateHashMapCounter(stage string) prometheus.Counter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all the helper functions here, the "stage" param I think can be removed, that's something more relevant to FLP (because in FLP we configure custom pipelines, each with a stage name, so it was relevant to tie metrics to their stage)

evictionCond *sync.Cond
lastEvictionNs uint64
hmapEvictionCounter prometheus.Counter
numberOfEvictedFlows prometheus.Gauge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this one would be a gauge, this isn't something that will vary up and down, right? Looking at code it's only adding .. so that would rather be a counter

TypeCounter,
"operational",
)
timeSpentInLookupandDeleteMapSecondsTotal = defineMetric(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name suggestion: lookup_and_delete_map_duration_seconds
(we don't use the "total" suffix here since we are not adding measurements)

"hashmap_evictions_total",
"Number of hashmap evictions total",
TypeCounter,
"operational",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here and below "operational" is also a FLP leftover that can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@jotak
Copy link
Member

jotak commented Feb 19, 2024

image

p99 of lookup&delete map is kinda scary... almost 1s

@msherif1234
Copy link
Contributor Author

image

p99 of lookup&delete map is kinda scary... almost 1s

is this at scale ?we know this path is most busy path and resouces hog in the agent

}

var (
hmapEvictionsTotal = defineMetric(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling that we could merge metrics used for ringbuf with ones used for maps.
E.g. we could have:

  • evictions_total{source=bpf_ringbuf | bpf_maps}
  • evicted_flows_total{source=bpf_ringbuf | bpf_maps}

That would replace hmapEvictionsTotal, userspaceNumberOfEvictionsTotal, numberOfevictedFlowsTotal and numberofFlowsreceivedviaRingBufferTotal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it possible to get a "reason" label here that could be for instance "timeout" or "batch full" ? (or anything else that can trigger an eviction)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since rb map is when map is full which shouldn't happen that often I think from debugging pov u need to see both so u can use this as hint to resize ur hmap table but if u merge u will lose this visibility ?
I will check for eviction reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is either timer eviction for hmap or event for rb not different reasons for evictions from looking at the code

Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>
return httpServer
}

func defaultServer(srv *http.Server) *http.Server {
Copy link
Member

@jotak jotak Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should come up with a common lib in netobserv org for this kind of code... the same code is used in console plugin, FLP and now here.
Anyway, not for this PR

Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
I will play more with that, create a dashboard etc. so perhaps will have further changes or additions to bring, but let's start from here and iterate if needed

thanks @msherif1234 !

@msherif1234
Copy link
Contributor Author

/approve

Copy link

openshift-ci bot commented Feb 21, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit e3dcdc6 into netobserv:main Feb 21, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants