Added gathering script for SNOs with workload partitioning #373

rbaturov · 2023-08-02T08:54:47Z

This PR adds support for collecting data needed for validating SNO's with workload partitioning enabled.

The scripts collects:

systemctl list-unit-files and systemctl list-unit, needed to check whether certain services are active or inactive.
CRIO directory to verify workload partitioning is properly configured.

The script starts to collect these files only after checking workload partitioning flags indicating the SNO has workload partitioning enabled.

https://issues.redhat.com/browse/TELCOSTRAT-156

openshift-ci · 2023-08-02T08:55:38Z

Hi @rbaturov. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RickJWagner

Looks good. Thank you for the informative opening comment (including link to workload partitioning, etc.)

davemulford · 2023-08-14T13:42:57Z

Echoing Rick's words about an informative opening comment. That was very helpful for me in reviewing this PR.
/lgtm

openshift-ci · 2023-08-14T13:57:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davemulford, rbaturov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~collection-scripts/OWNERS~~ [davemulford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sferich888

I think there are a few items we want to followup on before we push for testing on this.

sferich888 · 2023-08-14T14:17:34Z

collection-scripts/gather

@@ -117,5 +117,8 @@ oc adm inspect --dest-dir must-gather --rotated-pod-logs "${all_ns_resources[@]}
 # Gather Performance profile information
 /usr/bin/gather_ppc

+# Gather SNO resources
+/usr/bin/gather_sno


Do we want to run this all the time (even on non SNO clusters)?

I thought that workload partitioning is an SNO's only feature. but now I understand it being introduced for multi-node clusters from OCP 4.13. so maybe the "gather_sno" naming should be changed.
Anyhow, we want to run this script for each node, on a cluster that has workload partitioning enabled.

sferich888 · 2023-08-14T14:18:16Z

collection-scripts/gather_sno

+
+    local DEBUG_POD_NAME_PREFIX=""
+
+    #Start Debug pod force it to stay up until removed in "default" namespace


We should not start a debug pod in the default namespace; must-gather already runs in its own namespace (and has admin privileges).

I was following this template when writing my rule:
https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_core_dumps
Could you please guide me then how to do this properly?

sferich888 · 2023-08-14T14:20:12Z

collection-scripts/gather_sno

+    DEBUG_POD_NAME_PREFIX=${1//./}
+
+    #Mimic a normal oc call, i.e pause between two successive calls to allow pod to register
+    sleep 2


Can't you use a wait for a condition (vs a sleep)?

https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#wait see --for condition=available

You do this below; why is this sleep needed?

If there was no sleep then the pod will not register on time, and the 'oc wait' command will resolve with an error.
This what will happen if we omit the "sleep 2" command:
[rbaturov@rbaturov must-gather]$ source collection-scripts/gather_sno INFO: Waiting for SNO system data collection to complete ... error: arguments in resource/name form must have a single resource and name Debug pod for node sno2.r207-sno2.r207.lab.eng.cert.redhat.com never activated [1]+ Done get_system_data_off_sno "${NODE}" INFO: SNO system data collection complete. [rbaturov@rbaturov must-gather]$

sferich888 · 2023-08-14T14:22:37Z

collection-scripts/gather_sno

+
+      #Collect /etc/crio directory
+      echo "Collecting /etc/crio"
+      oc cp  -n "default" "$debugPod":/host/etc/crio "$NODES_PATH"/"$1"/crio > /dev/null 2>&1


@control-d do we collect this (or similar data in other ways)? Do we need to collect this again (if so)?

sferich888 · 2023-08-14T14:22:56Z

collection-scripts/gather_sno

+      oc cp  -n "default" "$debugPod":/host/etc/crio "$NODES_PATH"/"$1"/crio > /dev/null 2>&1
+
+      #clean up debug pod after we are done using them  
+      oc delete pod "$debugPod" -n "default"  


This isn't needed if the pod; is started in the must-gather namespace.

sferich888 · 2023-08-14T14:23:53Z

collection-scripts/gather_sno

+
+  mkdir -p "${NODES_PATH}"/
+
+  for NODE in ${NODES}; do


How do we limit the nodes that are supplied here (to the control plane)? What happens if we have 500 nodes? How long will this collection take?

If we limit this (do we need to collect this by default)?

openshift-ci-robot · 2023-08-14T16:00:51Z

/retest-required

Remaining retests: 0 against base HEAD 27652e4 and 2 for PR HEAD 48a957a in total

openshift-ci · 2023-08-14T20:22:49Z

@rbaturov: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ingvagabund · 2023-08-18T08:44:48Z

@davemulford looks like your lgtm accidentally added approve label as well. Also, I don't see any revert button here so we can quickly open a revert pr to undo the changes and resume the review.

@rbaturov given you authored the PR, it will be better if you revert the PR by hand and re-introduce the changes in a new PR so the review can resume.

sferich888 · 2023-08-21T00:19:26Z

/revert

Added gathering script for SNOs with workload partitioning

48a957a

openshift-ci bot requested review from davemulford and RickJWagner August 2, 2023 08:55

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 2, 2023

RickJWagner reviewed Aug 2, 2023

View reviewed changes

openshift-ci bot assigned davemulford Aug 14, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 14, 2023

sferich888 suggested changes Aug 14, 2023

View reviewed changes

openshift-merge-robot merged commit 7595b36 into openshift:master Aug 14, 2023
3 checks passed

rbaturov mentioned this pull request Aug 18, 2023

OCPBUGS-17907: Revert "Added gathering script for SNOs with workload partitioning" #376

Merged

rbaturov mentioned this pull request Aug 21, 2023

Added SNO with workload partitioning system data gathering script #377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added gathering script for SNOs with workload partitioning #373

Added gathering script for SNOs with workload partitioning #373

rbaturov commented Aug 2, 2023 •

edited

openshift-ci bot commented Aug 2, 2023

RickJWagner left a comment

davemulford commented Aug 14, 2023

openshift-ci bot commented Aug 14, 2023

sferich888 left a comment

sferich888 Aug 14, 2023

rbaturov Aug 15, 2023

sferich888 Aug 14, 2023

rbaturov Aug 15, 2023

sferich888 Aug 14, 2023

sferich888 Aug 14, 2023

rbaturov Aug 15, 2023

sferich888 Aug 14, 2023

sferich888 Aug 14, 2023

sferich888 Aug 14, 2023

openshift-ci-robot commented Aug 14, 2023

openshift-ci bot commented Aug 14, 2023

ingvagabund commented Aug 18, 2023

sferich888 commented Aug 21, 2023


		local DEBUG_POD_NAME_PREFIX=""

		#Start Debug pod force it to stay up until removed in "default" namespace

Added gathering script for SNOs with workload partitioning #373

Added gathering script for SNOs with workload partitioning #373

Conversation

rbaturov commented Aug 2, 2023 • edited

openshift-ci bot commented Aug 2, 2023

RickJWagner left a comment

Choose a reason for hiding this comment

davemulford commented Aug 14, 2023

openshift-ci bot commented Aug 14, 2023

sferich888 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Aug 14, 2023

openshift-ci bot commented Aug 14, 2023

ingvagabund commented Aug 18, 2023

sferich888 commented Aug 21, 2023

rbaturov commented Aug 2, 2023 •

edited