Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for sampling=1 bugs #132

Merged
merged 2 commits into from
Aug 2, 2022

Conversation

jotak
Copy link
Member

@jotak jotak commented Jul 12, 2022

By default, prevent setting sampling=1 for IPFIX (automatically raised
to 2 in that case)

Flag "forceSampleAll" can be use to force sampling=1 (used at their own risks)

See bugs:

Change ipfix cache defaults:

  • timeout decreased to 20s
  • max-flows increased to 400

@jotak
Copy link
Member Author

jotak commented Jul 12, 2022

cc @memodi

By default, prevent setting sampling=1 for IPFIX (automatically raised
to 2 in that case)
If flag "forceAllowSamplingAll" is set to true, the tule above is
ignored and user can still set it to 1 at their own risks

See bugs:
- https://bugzilla.redhat.com/show_bug.cgi?id=2103136
- https://bugzilla.redhat.com/show_bug.cgi?id=2104943

Change ipfix cache defaults:

- timeout decreased to 20s
- max-flows increased to 400
@mariomac
Copy link
Contributor

My main concern is that we are including a behavior that could be unnoticed by many users that do not completely read the documentation. Also, if the issue gets fixed and we remove this automatic fix in a future version, many users could see an unwanted/unexpected change in the behavior after they update (with implications e.g. in their internal network traffic or storage).

If we can't expect that the causing issue is fixed soon, I'd do: if sampling == 1, instead of silently set it to 2, return error and inform that sampling must be 2 or forceAllowSamplingAll property must be explicitly set to true at their own risk.

@jotak
Copy link
Member Author

jotak commented Jul 12, 2022

good point, yeah, probably better to return an error

@jotak jotak force-pushed the prevent-sampling-1 branch 2 times, most recently from 5130818 to 9f64b66 Compare July 13, 2022 08:48
It allows to inform the user of an issue as soon as they create or
update the custom resource

Field forceSampleAll allows to override the sampling definition
@jotak
Copy link
Member Author

jotak commented Jul 13, 2022

/retest

format: int32
minimum: 0
minimum: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no issues on eBPF side so I would not force this minimum here.

Moreover, I would prefer a realistic minimum based on perf tests for each agent more than a workaround here. Thoughts ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sampling for ebpf and ipfix are two different configs, so no problem on that side.

I don't think we want to introduce minimums in general, ideally, users should be able to set whatever they want and be careful about what they do. But the issue we found is critical enough (like, put a cluster down) and happened quite consistently with ipfix' sampling=1, so that's why we take extra measures to prevent it to happen in that specific situation. The stability really isn't the same between sampling=1 and sampling=2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I though it was a common field. I agree the user should understand what he does and not able to put the cluster down.

What about adding some numbered recommendations in the descriptions based on performance results as a followup task ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's something we need to document anyway. Captured here: https://issues.redhat.com/browse/NETOBSERV-493

@jotak
Copy link
Member Author

jotak commented Aug 2, 2022

I am going to merge this as I don't see moves on OVS side at the moment. We will rollback if it ends up being unnecessary. I'll also start releasing v0.1.4 candidate after that.

/approve

@openshift-ci
Copy link

openshift-ci bot commented Aug 2, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Aug 2, 2022
@openshift-merge-robot openshift-merge-robot merged commit 048b825 into netobserv:main Aug 2, 2022
@jotak jotak deleted the prevent-sampling-1 branch December 7, 2022 04:03
KalmanMeth pushed a commit to KalmanMeth/network-observability-operator that referenced this pull request Feb 13, 2023
fix manually location db and update automatic workflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants