Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1064: TCP drop & DNS tracking #324

Closed
wants to merge 28 commits into from

Conversation

jpinsonneau
Copy link
Contributor

@jpinsonneau jpinsonneau commented Apr 28, 2023

Adding packet loss query options as:

  • Dropped: show records fully dropped packets
  • Contains drops: show records that contains drops but may be sent
  • Sent: show records that contains sent packets
  • All: show all records

image

Also add drops count in Packets & Bytes columns:
image
Side panel detailed view:
image

Full page:
image

Topology redlines:
image

Overview new graphs:
image
image
image

Adding DNS request / response / latency:
image

Summary panel:
image

The console plugin check for features: ['tcpDrop', 'dnsTracking'] in its ConfigMap to show / hide the additionnal panels / columns.

ToDo:

  • Dark theme
  • Topology lines color
  • Topology selection actually works fine on 4.13.4
  • TcpDropState graph
  • TcpDropCause graph
  • Exclude "0 = empty dropped" state / "not dropped yet" cause from queries Replaced by full text

You can use make start-standalone-mock to try this PR without running related ones:
netobserv/netobserv-ebpf-agent#115

@eranra
Copy link
Contributor

eranra commented May 2, 2023

@jpinsonneau Looks good
(1) can we agree to "remove" the entries when there is N/A --- I do not see a logic to showing red-colored N/A values
(2) Looking into the UI screenshot - I see partially dropped packets in some entries --- is that correct? Do we drop parts of the data from a flow? This is interesting ( maybe this is due to sampling? and if so, this will cause a lot of questions ... is the data correct )

@jpinsonneau
Copy link
Contributor Author

@jpinsonneau Looks good (1) can we agree to "remove" the entries when there is N/A --- I do not see a logic to showing red-colored N/A values (2) Looking into the UI screenshot - I see partially dropped packets in some entries --- is that correct? Do we drop parts of the data from a flow? This is interesting ( maybe this is due to sampling? and if so, this will cause a lot of questions ... is the data correct )

Hey Eran, thanks for taking a look to this PoC 👍

I can hide the n/a for sure. I just wanted a way to distinguish empty values to zeros.

This was only test data so I'm not sure we will have partial drops yet. I feel it's possible when retry occurs ? We should see that in netobserv/network-observability-operator#331

@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 2, 2023
@codecov
Copy link

codecov bot commented May 2, 2023

Codecov Report

Merging #324 (3cf1597) into main (38f4cfc) will decrease coverage by 0.33%.
The diff coverage is 54.19%.

❗ Current head 3cf1597 differs from pull request most recent head 13c7909. Consider uploading reports for the commit 13c7909 to get more accurate results

@@            Coverage Diff             @@
##             main     #324      +/-   ##
==========================================
- Coverage   57.92%   57.60%   -0.33%     
==========================================
  Files         150      153       +3     
  Lines        6665     7019     +354     
  Branches      797      852      +55     
==========================================
+ Hits         3861     4043     +182     
- Misses       2588     2742     +154     
- Partials      216      234      +18     
Flag Coverage Δ
uitests 59.19% <64.96%> (+0.35%) ⬆️
unittests 53.12% <32.65%> (-2.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/handler/lokiclientmock/loki_client_mock.go 0.00% <0.00%> (ø)
pkg/model/fields/fields.go 100.00% <ø> (ø)
web/src/api/ipfix.ts 100.00% <ø> (ø)
web/src/api/loki.ts 100.00% <ø> (ø)
web/src/components/__tests-data__/config.ts 100.00% <ø> (ø)
...components/dropdowns/overview-display-dropdown.tsx 68.42% <ø> (ø)
web/src/components/dropdowns/scope-dropdown.tsx 31.81% <ø> (ø)
...components/dropdowns/topology-display-dropdown.tsx 47.36% <ø> (ø)
web/src/components/messages/loki-error.tsx 24.28% <ø> (ø)
web/src/components/metrics/histogram.tsx 16.51% <ø> (ø)
... and 38 more

... and 2 files with indirect coverage changes

@github-actions
Copy link

github-actions bot commented May 2, 2023

New image: ["quay.io/netobserv/network-observability-console-plugin:793cb4f"]. It will expire after two weeks.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 2, 2023
@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 23, 2023
@github-actions
Copy link

New image: quay.io/netobserv/network-observability-console-plugin:f627dbf. It will expire after two weeks.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 26, 2023
@msherif1234 msherif1234 changed the title NETOBSERV-979 PoC packet loss NETOBSERV-1064: PoC packet loss May 26, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented May 26, 2023

@jpinsonneau: This pull request references NETOBSERV-1064 which is a valid jira issue.

In response to this:

Adding packet loss query options as:

  • Dropped: show records fully dropped packets
  • Contains drops: show records that contains drops but may be sent
  • Sent: show records that contains sent packets
  • All: show all records

image

Also add drops count in Packets & Bytes columns:
image
Side panel detailed view:
image

Full page:
image

Topology redlines:
image

Overview new graphs:
image
image
image

ToDo:

  • Dark theme
  • Topology lines color
  • Topology selection
  • TcpDropState graph
  • TcpDropCause graph
  • Exclude "0 = empty dropped" state / "not dropped yet" cause from queries

You can use make start-standalone-mock to try this PR without running related ones:
netobserv/netobserv-ebpf-agent#115

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jpinsonneau jpinsonneau changed the title NETOBSERV-1064: PoC packet loss NETOBSERV-1064: PoC TCP drop & DNS tracking Jun 6, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 6, 2023

@jpinsonneau: This pull request references NETOBSERV-1064 which is a valid jira issue.

In response to this:

Adding packet loss query options as:

  • Dropped: show records fully dropped packets
  • Contains drops: show records that contains drops but may be sent
  • Sent: show records that contains sent packets
  • All: show all records

image

Also add drops count in Packets & Bytes columns:
image
Side panel detailed view:
image

Full page:
image

Topology redlines:
image

Overview new graphs:
image
image
image

Adding DNS request / response / latency:
image

Summary panel:
image

ToDo:

  • Dark theme
  • Topology lines color
  • Topology selection
  • TcpDropState graph
  • TcpDropCause graph
  • Exclude "0 = empty dropped" state / "not dropped yet" cause from queries

You can use make start-standalone-mock to try this PR without running related ones:
netobserv/netobserv-ebpf-agent#115

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 6, 2023
@github-actions
Copy link

github-actions bot commented Jun 6, 2023

New image: quay.io/netobserv/network-observability-console-plugin:f8c015c. It will expire after two weeks.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 6, 2023
@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 6, 2023
@github-actions
Copy link

github-actions bot commented Jun 6, 2023

New image: quay.io/netobserv/network-observability-console-plugin:05b8fa1. It will expire after two weeks.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 6, 2023
@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 6, 2023
@github-actions
Copy link

github-actions bot commented Jun 6, 2023

New image: quay.io/netobserv/network-observability-console-plugin:f48fbb3. It will expire after two weeks.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 6, 2023
@jpinsonneau
Copy link
Contributor Author

Fully retested in OCP console with addressed feedback:

  • updated descriptions
    • without drops (previously sent) filter now match on "TcpDropPackets":0
  • force strict label check
  • reuse lineFilters

image
image
image
image

@jpinsonneau jpinsonneau requested a review from jotak July 10, 2023 10:31
@@ -147,6 +147,9 @@ const getPeerName = (
if (name) {
return truncateParts(name, truncateLength);
}
if (scope === 'droppedCause' || scope === 'droppedState') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be for a follow-up PR: looks like these new values aren't really scopes, and since they need sometimes to be excluded, maybe we could compose types such as:

   type FlowScope = 'host' | ... // (usual flow scopes);
   type AggregateBy = FlowScope | 'droppedCause' | 'droppedState';

So we don't mix them too much which will avoid having to handle special cases like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we go: 826ac69

if (error) {
return <LokiError title={t('Unable to get overview')} error={error} />;
} else if (_.isEmpty(metrics) || !totalMetric) {
//TODO: manage each metrics loading state separately
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a task for that in tech-debt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


const timeContent = () => {
const filteredFlows = flows || [];
const duration =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a check is needed that filteredFlows.length is >0 (avoid div by 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we go: 3cf1597

@jotak
Copy link
Member

jotak commented Jul 11, 2023

Finished the code review ; very few comments / nothing big!

@jotak
Copy link
Member

jotak commented Jul 11, 2023

My main concern is for the whole DNS epic: I don't think it brings as much value as it could, because of the user workflow: having to find specific DNS request and response by playing with filters, before being eventually able to get the latency info, makes it only relevant for specific troubleshooting use cases and not for monitoring in a broader sense, and even then, the value added is limited as we can get the same info by looking at rq/rs flow timestamps.

I get that this is inherent to the data provided by the Agent, so maybe we need to discuss more about how we could enhance the experience (e.g. revisit how the agent provides the dns flows?)

This is not blocking this PR, but I think we should either find follow-ups to do as part of the DNS epic, or flag this epic as experimental, or something in that vein, if it's not fully fleshed and is required to be revisited.

Also, as follow-ups for DNS: should we do anything specific about the DNS flags? I see the field is added but nothing is done from it. E.g. filtering for DNS errors, showing errors in a specific way?

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 11, 2023
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 11, 2023
@github-actions
Copy link

New image: quay.io/netobserv/network-observability-console-plugin:ba451d6. It will expire after two weeks.

@jotak
Copy link
Member

jotak commented Jul 11, 2023

/lgtm
thanks @jpinsonneau !

@openshift-ci openshift-ci bot removed the lgtm label Jul 12, 2023
@openshift-ci
Copy link

openshift-ci bot commented Jul 12, 2023

New changes are detected. LGTM label has been removed.

@openshift-ci
Copy link

openshift-ci bot commented Jul 12, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jotak. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 12, 2023
@jpinsonneau
Copy link
Contributor Author

/lgtm thanks @jpinsonneau !

@jotak looks like rebasing makes me forget about a flag to expose DNS Id filter:
13c7909
image

Have you tested it ?

@Amoghrd
Copy link
Contributor

Amoghrd commented Jul 12, 2023

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 12, 2023
@github-actions
Copy link

New image:
quay.io/netobserv/network-observability-console-plugin:7f56fc4

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=7f56fc4 make set-plugin-image

@msherif1234
Copy link
Contributor

#356 included this PR

@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jpinsonneau
Copy link
Contributor Author

Replaced by #356

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference needs-rebase ok-to-test To set manually when a PR is safe to test. Triggers image build on PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants