Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1099: remove reporter option #311

Merged
merged 4 commits into from Aug 3, 2023
Merged

Conversation

jotak
Copy link
Member

@jotak jotak commented Mar 23, 2023

  • Add a new reporter option to merge reported flows (in fact it is a disguised "both", and just filter out one of the reporters per src/dest key)
  • Add test

(outdated capture)
Capture d’écran du 2023-03-23 12-25-15

New capture:
image

@jotak
Copy link
Member Author

jotak commented Mar 23, 2023

Opening as draft for now ; it may need some discussion.

  • At the moment I'm playing conservative, keeping all the previous reporter options, even if ultimately we might end up with just this one
  • It works for the flow table, but not for metrics, so we can't entirely get rid of this setting anyway
  • It doesn't solve everything wrt return traffic, I opened a work document for the remaining problems: https://docs.google.com/document/d/14ptPh60WtZavJ2q22BpHENibvhdAD4fLRRjBsHZBVjY/edit

Also, as this is kind of a new feature, I suggest to hold for 1.2 and merge that only for 1.3

@codecov
Copy link

codecov bot commented Mar 23, 2023

Codecov Report

Merging #311 (438cbed) into main (13e24e4) will increase coverage by 0.21%.
The diff coverage is 72.19%.

@@            Coverage Diff             @@
##             main     #311      +/-   ##
==========================================
+ Coverage   57.15%   57.37%   +0.21%     
==========================================
  Files         163      166       +3     
  Lines        7543     7706     +163     
  Branches      921      909      -12     
==========================================
+ Hits         4311     4421     +110     
- Misses       2967     3019      +52     
- Partials      265      266       +1     
Flag Coverage Δ
uitests 58.35% <61.40%> (+0.32%) ⬆️
unittests 54.69% <82.16%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
pkg/handler/loki.go 42.02% <ø> (ø)
pkg/model/fields/fields.go 100.00% <ø> (ø)
web/src/components/metrics/histogram.tsx 16.51% <ø> (ø)
web/src/components/metrics/latency-donut.tsx 29.62% <ø> (ø)
...omponents/metrics/single-metrics-total-content.tsx 29.03% <ø> (ø)
web/src/components/modals/export-modal.tsx 52.23% <ø> (ø)
...c/components/netflow-overview/netflow-overview.tsx 65.62% <ø> (ø)
...omponents/netflow-topology/2d/topology-content.tsx 28.42% <ø> (ø)
...s/netflow-topology/3d/three-d-topology-content.tsx 8.82% <ø> (ø)
...omponents/netflow-topology/element-panel-stats.tsx 100.00% <ø> (ø)
... and 26 more

... and 1 file with indirect coverage changes

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 16, 2023

@jotak: This pull request references NETOBSERV-1099 which is a valid jira issue.

In response to this:

  • Add a new reporter option to merge reported flows (in fact it is a disguised "both", and just filter out one of the reporters per src/dest key)
  • Add test

Capture d’écran du 2023-03-23 12-25-15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jotak
Copy link
Member Author

jotak commented Jun 16, 2023

I revisited this old PR, I guess we could take a stab at it

@jotak jotak marked this pull request as ready for review June 16, 2023 09:06
@jotak
Copy link
Member Author

jotak commented Jun 16, 2023

making 'merge' as the new default reporter is a bit tricky since it's only for table view, so we need to deal with 2 defaults, and being able to switch to default when user changes tabs UNLESS they explicitly choosed Source or Destination from the overview/topology tab, in which case probably we want to keep that selection when moving to Table view ... 😵‍💫

or ... more simply, keep reporter-for-table and reporter-for-metrics as two distinct states - hence it wouldn't be applied when moving from Topo/Overview to Table or vice-versa

If someone as a better suggestion, speak :-)

@jpinsonneau
Copy link
Contributor

What is missing to make 'merged' works with metrics ?
Can't we group by conversationId ? (Maybe need some improvements first ?)

As a user POV, I feel 'merged' should be the unique way. If you are interested in particular source or destination, just apply any filter.

@jotak
Copy link
Member Author

jotak commented Jun 19, 2023

What is missing to make 'merged' works with metrics ? Can't we group by conversationId ? (Maybe need some improvements first ?)

I'm afraid this just won't be possible with the data model that we have: segregating flows by source and dest IP is necessary for merging. (Like done here). Metrics are aggregated e.g. per namespaces, this isn't a sufficient level of details to dedup.

To take an example, imagine these flows:

F1: ingress ([ns=A, pod ip=1.2.3.4] => [ns=B, pod ip=1.2.3.5])
F2: egress ([ns=A, pod ip=1.2.3.4] => [ns=B, pod ip=1.2.3.5])
F3: egress ([ns=A, pod ip=1.2.3.4] => [ns=B, svc ip=10.20.30.50])

Here, flows 1 and 2 are the same and will be deduped, by taking only the "ingress" ones (first to come). Flow 3 has a different dest IP so it will be kept as well. So F1 and F3 remain.

With metrics, such as aggregated by namespace, you'd get:

M1: ingress ([ns=A] => [ns=B]) <=> F1
M2: egress ([ns=A] => [ns=B]) <=> F2+F3

Applying the same logic would keep only M1, which actually doesn't integrate F3 values, so the metrics would be incorrect.

=> you cannot just pick one and forget the other. The Ingress metric would miss some flows that are only in Egress, and vice-versa.

As a user POV, I feel 'merged' should be the unique way. If you are interested in particular source or destination, just apply any filter.

I agree, except perhaps we'd also want to be able to see BOTH (unmerged)

@jotak
Copy link
Member Author

jotak commented Jun 19, 2023

@jpinsonneau I don't see an easy way to solve that. Dedup done from the agents or FLP are limited because they don't have the full-nodes vision. Dedup done from the UI / loki queries, as in this PR, work only for raw flows, not for aggregated metrics. Perhaps that could be done from an intermediate job, able to have the full-nodes vision AND to rewrite flows by editing the "dedup" label. But that would increase the overall complexity of netobserv, not talking about performances, so not sure if it's worth it.

@jotak
Copy link
Member Author

jotak commented Jun 19, 2023

(thinking out loud)
maybe for metrics we could use a different algorithm perhaps less generic (it does more assumptions) :
Taking for instance all EGRESS + all INGRESS matching [src_name=unknown]
Assumptions being that:

  • In-cluster flows are duplicated between ingress and egress, so getting all EGRESS shows a correct picture of the in-cluster metrics
  • Cluster-egress flows are covered by EGRESS
  • Cluster-ingress is covered by all INGRESS matching [src_name=unknown]
  • There's no overlap between all EGRESS and all INGRESS matching [src_name=unknown]

The downside is, the more we do assumptions, the more "vulnerable" we are to CNI specific behaviours.

@jotak
Copy link
Member Author

jotak commented Jun 19, 2023

I think this is going well: so I ended up removing entirely "Reporter", trading it for a simpler approach of "show duplicates: yes/no":

image

In the table view, this option allows to hide all duplicate traffic; two algorithms come into play:

  • The existing "Duplicate" field that is inserted by the agent, by filtering with "Duplicate: false".
  • Since this field doesn't account for duplication across nodes, the new "merge" algorithm adds dedup across nodes.

In the topology/metrics views, this option is not available (flows are always deduped - to not mess up with the rate counters); however the solution to dedup differs since the algorithm used in the table cannot be used here, as it requires a higher level of details (with IP) that aggregate queries don't provide. There's still two algorithms in play:

  • Again, the existing "Duplicate" field that is inserted by the agent, by filtering with "Duplicate: false" - this was already active before this PR
  • Another one that plays with FlowDirection to aggregate EGRESS and INGRESS without overlap.

Users who want a specific reporter / flow direction can still use the appropriate filter

Comment on lines 66 to 78
// Merging is done by running a first query with FlowDirection=EGRESS and another with FlowDirection=INGRESS AND SrcOwnerName is empty.
// (Note that we use SrcOwnerName both as an optimization as it's a Loki index,
// and as convenience because looking for empty fields won't work if they aren't indexed)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpinsonneau I'm not totally sure if this would work with conversation or if we need to skip - or do something else - in this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm never mind - topology & metrics always use raw flows, not conversations, and this code is only called from topology endpoint ... so that's fine I guess

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics queries can be based on conversations when available.
As Both FlowDirection and SrcOwnerName fields are available in conversations, the approch is fine.
However:

  • FlowDirection is not reinterpreted in conversations
  • SrcOwnerName can actually be flipped by the swapAB option depending on TCP flags

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so .. this brings us back to swapAB needing to switch direction.. I'll open a separate bug about that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jotak jotak changed the title NETOBSERV-1099: new "merge" reporter option NETOBSERV-1099: remove reporter option Jul 5, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jul 5, 2023

@jotak: This pull request references NETOBSERV-1099 which is a valid jira issue.

In response to this:

  • Add a new reporter option to merge reported flows (in fact it is a disguised "both", and just filter out one of the reporters per src/dest key)
  • Add test

(outdated capture)
Capture d’écran du 2023-03-23 12-25-15

New capture:
image

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 27, 2023
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 27, 2023
@github-actions
Copy link

New image:
quay.io/netobserv/network-observability-console-plugin:5aad792

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=5aad792 make set-plugin-image

"Every flow can be reported from the source node and/or the destination node. For in-cluster traffic, usually both source and destination nodes report flows, resulting in duplicated data. Cluster ingress traffic is only reported by destination nodes, and cluster egress by source nodes.": "Every flow can be reported from the source node and/or the destination node. For in-cluster traffic, usually both source and destination nodes report flows, resulting in duplicated data. Cluster ingress traffic is only reported by destination nodes, and cluster egress by source nodes.",
"Reporter node": "Reporter node",
"Only available in Table view.": "Only available in Table view.",
"A flow might be reported from several interfaces, and from both the source and the destination nodes, making it appear several times. By default, duplicates are hidden. Showing duplicates is not possible in Overview and Topology tabs.": "A flow might be reported from several interfaces, and from both the source and the destination nodes, making it appear several times. By default, duplicates are hidden. Showing duplicates is not possible in Overview and Topology tabs.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"A flow might be reported from several interfaces, and from both the source and the destination nodes, making it appear several times. By default, duplicates are hidden. Showing duplicates is not possible in Overview and Topology tabs.": "A flow might be reported from several interfaces, and from both the source and the destination nodes, making it appear several times. By default, duplicates are hidden. Showing duplicates is not possible in Overview and Topology tabs.",
"A flow might be reported from several interfaces, and from both source and destination nodes, making it appear several times. By default, duplicates are hidden. Showing duplicates is not possible in Overview and Topology tabs.": "A flow might be reported from several interfaces, and from both source and destination nodes, making it appear several times. By default, duplicates are hidden. Showing duplicates is not possible in Overview and Topology tabs.",

jpinsonneau
jpinsonneau previously approved these changes Jul 28, 2023
Copy link
Contributor

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jotak, LGTM in terms of code, just a small change in the tooltip

Not tested yet; but let me know if you need me to test some corner cases on that with you @memodi 😸

@memodi
Copy link
Contributor

memodi commented Jul 28, 2023

Not tested yet; but let me know if you need me to test some corner cases on that with you @memodi 😸

I'd appreciate that :D

@openshift-ci
Copy link

openshift-ci bot commented Jul 31, 2023

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci bot removed the lgtm label Jul 31, 2023
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 31, 2023
@memodi
Copy link
Contributor

memodi commented Aug 2, 2023

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Aug 2, 2023
@github-actions
Copy link

github-actions bot commented Aug 2, 2023

New image:
quay.io/netobserv/network-observability-console-plugin:d6b7cbf

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=d6b7cbf make set-plugin-image

@memodi
Copy link
Contributor

memodi commented Aug 2, 2023

@jotak - In terms of testing this, here I made sure by default only "Duplicate: false" show in table view and with check box both kind show and checkbox is disabled for both overview/topology page. Let me know if there's any other scenarios to test here, otherwise consider this qe-approved and feel free to merge.

@jotak
Copy link
Member Author

jotak commented Aug 3, 2023

Thanks @memodi
Maybe what could be tested in addition, is that the scenarios such as described in NETOBSERV-696 and NETOBSERV-697 are now 100% good. Both this PR and the bnf work were done with that in mind, to fix some corner cases when observing traffic between two identified peers - especially when one of them is cluster external. Cf for instance the description in NETOBSERV-696 that provides reproducible steps.

EDIT: to be more accurate: the scenario that this is fixing is described in NETOBSERV-697 ("A possible solution is to get all flows and deduplicate. Ideally, information on what nodes the flows have traversed should be preserved. If done right, the Reporter node option can be removed.") ; but for testing you can run the steps described in NETOBSERV-696 :-)

EDIT 2: Also as I wrote on slack, you saw another limitation of the unmerged reporters when looking at traffic to services - merged reporters solve that, ie. you should see traffic from/to services without having to switch between ingress/egress

@jpinsonneau
Copy link
Contributor

jpinsonneau commented Aug 3, 2023

The average metrics seems to be different between main (in dark theme on the left) and your PR (light theme on the right)
image
image
=> oh ok it seems it uses reporter Source on metrics now. By default we used Destination that always catch more. I think it would be better to keep this behavior here. @jotak WDYT ?

The tables looks good using reporter both / showing duplicates
image
I'm still investigating on these 🔬
=> see mergeFlowReporters comment

Comment on lines +4 to +12
export const mergeFlowReporters = (flows: Record[]): Record[] => {
// The purpose of this function is to determine if, for a given [srcip, dstip] couple, we'll look at INGRESS or EGRESS reporter
// The assumption is that INGRESS alone, or EGRESS alone, or both of them, always provide a complete visiblity, so we can just pick one of the two.
// The logic if then to index flows by src+dest ips, then for each indexed set, keep only the first-found reporter
const keyFunc = (r: Record) => r.fields.SrcAddr + '+' + r.fields.DstAddr;
const grouped = _.groupBy(flows, keyFunc);
const filtersIndex = _.mapValues(grouped, (recs: Record[]) => recs[0].labels.FlowDirection);
return flows.filter((r: Record) => r.labels.FlowDirection === filtersIndex[keyFunc(r)]);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts about this functions. INGESS or EGRESS alone doesn't report the same amount of packets / bytes

You see the difference by taking the last found reporter for example recs[recs.length -1]

Last reporter:
image

First reporter:
image

All:
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a ticket and I will investigate post-merge? There might be some discrepancies related to having sampling, but I need to confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Add a new reporter option to merge reported flows (in fact, just
  filter out one of the reporters per src/dest key)
- Add test

Replace "reporter" setting with "show duplicates"

- In table view, instead of choosing a reporter (or both / merged)
the user can show duplicates ( = previously reporter=both)
or hide them ( = the new "merged" reporter + filtering with Duplicate=false)
- In metrics, showing duplicates is not allowed
- If the user wants so, it's still possible to filter with FlowDirection
using the regular filters
- In metrics, the dedup algorithm differs from the table view:
it's a specific query to get EGRESS + cluster-external INGRESS

Split API calls for topology and drops

As they result in quite different data model and processing, explicit
distinct routes allows to better separate concerns and avoid frequent
casts.

Issues with merge reporter for drops

Fetch Ingress in priority, seems to work better, however there might be
more to do as the assumption done for merged reporters doesn't seem
relevant for drops

Improve input validation

Check more query parameters, especially those that end up injected such
as rate interval

Fix: no dedup for drops

Add validation tests
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Aug 3, 2023
@jotak jotak merged commit dc83aa0 into netobserv:main Aug 3, 2023
6 of 7 checks passed
@memodi
Copy link
Contributor

memodi commented Aug 3, 2023

Maybe what could be tested in addition, is that the scenarios such as described in NETOBSERV-696 and NETOBSERV-697 are now 100% good.

ack, thanks @jotak , I will try these

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants