Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1131: Filter for Duplicate=false in metrics #387

Merged
merged 2 commits into from Jul 10, 2023

Conversation

jotak
Copy link
Member

@jotak jotak commented Jun 29, 2023

Requires FLP PR: netobserv/flowlogs-pipeline#448

Some observations when comparing byte rates from the plugin versus from prom metrics, e.g. between two nodes, e.g. with a promQL like that:

sum(rate(netobserv_node_ingress_bytes_total{DstK8S_HostName="ip-10-0-137-27.eu-west-3.compute.internal"}[90s])) by (SrcK8S_HostName, DstK8S_HostName)

And on plugin's side, similarly, filtering on destination node name="ip-10-0-137-27.eu-west-3.compute.internal" , using Scope=Node and observing the top 5 flow rates in Overview:

Before this PR I see increased values, on prometheus side, in a range going from +20% to +80%
After this PR I still see increased values on prom side but seems much more reasonable, from +10% to +25%
(data)

I don't have a definitive explanation as to why I continue to see increased values, but I can make the hypothesis that they come from differences in rate calculation between loki and prometheus - also, the metrics view in the console provide a better resolution than our overview charts, which might play a role in the discrepancies. Anyway, I think this PR brings back values in a much more acceptable range.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 29, 2023

@jotak: This pull request references NETOBSERV-1131 which is a valid jira issue.

In response to this:

Requires FLP PR: netobserv/flowlogs-pipeline#448

Some observations when comparing byte rates from the plugin versus from prom metrics, e.g. between two nodes, e.g. with a promQL like that:

sum(rate(netobserv_node_ingress_bytes_total{DstK8S_HostName="ip-10-0-137-27.eu-west-3.compute.internal"}[90s])) by (SrcK8S_HostName, DstK8S_HostName)

And on plugin's side, similarly, filtering on destination node name="ip-10-0-137-27.eu-west-3.compute.internal" , using Scope=Node and observing the top 5 flow rates in Overview:

Before this PR I see increased values, on prometheus side, in a range going from +20% to +80%
After this PR I still see increased values on prom side but seems much more reasonable, from +10% to +25%
(data)

I don't have a definitive explanation as to why I continue to see increased values, but I can make the hypothesis that they come from differences in rate calculation between loki and prometheus - also, the metrics view in the console provide a better resolution than our overview charts, which might play a role in the discrepancies. Anyway, I think this PR brings back values in a much more acceptable range.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@codecov
Copy link

codecov bot commented Jun 29, 2023

Codecov Report

Merging #387 (c38b09d) into main (57a6951) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #387   +/-   ##
=======================================
  Coverage   53.72%   53.72%           
=======================================
  Files          45       45           
  Lines        5515     5515           
=======================================
  Hits         2963     2963           
  Misses       2342     2342           
  Partials      210      210           
Flag Coverage Δ
unittests 53.72% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

jpinsonneau
jpinsonneau previously approved these changes Jul 3, 2023
Copy link
Contributor

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jotak

Is it worth having an extra metric from namespace_flows_total with the duplicate: false filter ?

@jotak
Copy link
Member Author

jotak commented Jul 3, 2023

Is it worth having an extra metric from namespace_flows_total with the duplicate: false filter ?

That raises an interesting point, personally I see this metric more as a Health metric than a Network insight metric: because it tells about flows and not about bytes or packets, which is an arbitrary measure for netobserv. If it's there for health reasons, filtering out duplicate doesn't make sense IMO (it's an indicator of the "load" that netobserv components deal with, answering the question: which namespaces are the most responsible of what netobserv is enduring? - removing duplicates would hide some of that load).

But that said, it's part of our NetObserv dashboard, not of NetObserv Health - which perhaps is a mistake ?

KalmanMeth
KalmanMeth previously approved these changes Jul 5, 2023
@jotak jotak dismissed stale reviews from KalmanMeth and jpinsonneau via c38b09d July 5, 2023 07:48
@openshift-ci openshift-ci bot removed the lgtm label Jul 5, 2023
@openshift-ci openshift-ci bot added the lgtm label Jul 5, 2023
@memodi
Copy link
Contributor

memodi commented Jul 7, 2023

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 7, 2023
@github-actions
Copy link

github-actions bot commented Jul 7, 2023

New images:

  • quay.io/netobserv/network-observability-operator:06c9459
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-06c9459
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-06c9459

They will expire after two weeks.

Catalog source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-06c9459
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@memodi
Copy link
Contributor

memodi commented Jul 7, 2023

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label Jul 7, 2023
@jotak
Copy link
Member Author

jotak commented Jul 10, 2023

/approve

@openshift-ci
Copy link

openshift-ci bot commented Jul 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit baa4259 into netobserv:main Jul 10, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved jira/valid-reference lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants