Skip to content

NETOBSERV-2358 exit on daemonset failure with logs#373

Merged
openshift-merge-bot[bot] merged 4 commits intonetobserv:mainfrom
jpinsonneau:2358
Sep 17, 2025
Merged

NETOBSERV-2358 exit on daemonset failure with logs#373
openshift-merge-bot[bot] merged 4 commits intonetobserv:mainfrom
jpinsonneau:2358

Conversation

@jpinsonneau
Copy link
Member

@jpinsonneau jpinsonneau commented Sep 1, 2025

Description

The eBPF agent errors at startup will be displayed in CLI.
Example on OCP 4.15 with latest eBPF agent on packet capture (unsupported kernel):

Checking dependencies... 
'yq' is up to date (version v4.45.1).
'bash' is up to date (version v5.2.37).
Setting up... 
kube:admin
creating netobserv-cli namespace
namespace/netobserv-cli created
creating service account
serviceaccount/netobserv-cli created
clusterrole.rbac.authorization.k8s.io/netobserv-cli unchanged
clusterrolebinding.rbac.authorization.k8s.io/netobserv-cli unchanged
creating collector service
service/collector created
creating packet-capture agents
opt: filter_protocol, value: TCP
daemonset.apps/netobserv-cli created
Waiting for daemonset pods to be ready...
0/3 Ready. Reason(s): CrashLoopBackOff

ERROR: Daemonset pods failed to start:
Found 3 pods, using pod/netobserv-cli-d7vqz
time="2025-09-01T16:55:28Z" level=fatal msg="[PCA] can't instantiate NetObserv eBPF Agent" error="loading and assigning BPF objects: field TcEgressPcaParse: program tc_egress_pca_parse: load program: permission denied: invalid access to memory, mem_size=272 off=16 size=0: R3 min value is outside of the allowed memory range (514 line(s) omitted)"

kube:admin
Copy skipped

Cleaning up...
Deleting service monitor... 
Deleting dashboard configmap... 
Deleting daemonset... daemonset.apps "netobserv-cli" deleted

Deleting pod... 
Deleting namespace... namespace "netobserv-cli" deleted

Since the error from eBPF agent may be not very explicit, I have added a check per OCP version (when available):

Checking dependencies... 
'yq' is up to date (version v4.45.1).
'bash' is up to date (version v5.2.37).
Setting up... 
cluster-admin
OpenShift version: 4.12.0
- Network events requires OpenShift 4.19 or higher
- UDN mapping requires OpenShift 4.18 or higher
- Packet drops requires OpenShift 4.14 or higher
Remove not compatible features and try again
cluster-admin
Copy skipped

Cleaning up...
Deleting service monitor... 
Deleting dashboard configmap... 
Deleting daemonset... 
Deleting pod... 
Deleting namespace...

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@jpinsonneau jpinsonneau added the needs-review Tells that the PR needs a review label Sep 2, 2025
}

function checkClusterVersion() {
version=$(${K8S_CLI_BIN} get clusterversion version -o jsonpath='{.status.history[*].version}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does history[*] do exactly? Pick first/any? Just wondering if that works if the history contains old cluster versions ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[*] is the equivalent of map here.

If multiple versions are listed there I will end with an array in $version 🤔
I should probably lock that using the Completed state as we do in the operator side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will do the job: 651f944

If the version is not found, I will simply skip the check with a warning since the cluster is probably upgrading.


echo "OpenShift version: $version"
if [[ "$command" = "packets" ]]; then
compare_versions "$version" 4.16.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if compare_versions will work with funky versions like nightly / ci and others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpinsonneau
Copy link
Member Author

/retest

@codecov
Copy link

codecov bot commented Sep 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 17.20%. Comparing base (00ae67e) to head (651f944).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #373   +/-   ##
=======================================
  Coverage   17.20%   17.20%           
=======================================
  Files          15       15           
  Lines        2133     2133           
=======================================
  Hits          367      367           
  Misses       1740     1740           
  Partials       26       26           
Flag Coverage Δ
unittests 17.20% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jotak
Copy link
Member

jotak commented Sep 5, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Sep 5, 2025
@jotak jotak removed the needs-review Tells that the PR needs a review label Sep 5, 2025
@memodi
Copy link
Member

memodi commented Sep 12, 2025

/ok-to-test

@github-actions
Copy link

New image:
quay.io/netobserv/network-observability-cli:2f285a0

It will expire after two weeks.

To use this build, update your commands using:

USER=netobserv VERSION=2f285a0 make commands

or download the updated commands.

@memodi
Copy link
Member

memodi commented Sep 12, 2025

/label qe-approved

@memodi
Copy link
Member

memodi commented Sep 12, 2025

/approve

@jpinsonneau
Copy link
Member Author

@memodi
Copy link
Member

memodi commented Sep 15, 2025

/retest

@memodi
Copy link
Member

memodi commented Sep 15, 2025

@memodi any clue on why integration tests failed ?

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/netobserv_network-observability-cli/373/pull-ci-netobserv-network-observability-cli-main-integration-tests/1966548512875220992

looks like an infra issue where cluster didn't come up. I see it was passed previously. Let's see if re-test comes better.

@memodi
Copy link
Member

memodi commented Sep 16, 2025

looks like an infra issue where cluster didn't come up. I see it was passed previously. Let's see if re-test comes better.

it passed on re-run, we're good to merge here.

@jpinsonneau
Copy link
Member Author

looks like an infra issue where cluster didn't come up. I see it was passed previously. Let's see if re-test comes better.

it passed on re-run, we're good to merge here.

Awesome, thanks for the confirmation

@openshift-ci
Copy link

openshift-ci bot commented Sep 17, 2025

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: memodi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 9fb7ab2 into netobserv:main Sep 17, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants