Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added gathering script for SNOs with workload partitioning #373

Merged
merged 1 commit into from Aug 14, 2023

Conversation

rbaturov
Copy link
Contributor

@rbaturov rbaturov commented Aug 2, 2023

This PR adds support for collecting data needed for validating SNO's with workload partitioning enabled.

The scripts collects:

  1. systemctl list-unit-files and systemctl list-unit, needed to check whether certain services are active or inactive.
  2. CRIO directory to verify workload partitioning is properly configured.

The script starts to collect these files only after checking workload partitioning flags indicating the SNO has workload partitioning enabled.

https://issues.redhat.com/browse/TELCOSTRAT-156

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 2, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 2, 2023

Hi @rbaturov. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@RickJWagner RickJWagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you for the informative opening comment (including link to workload partitioning, etc.)

@davemulford
Copy link

Echoing Rick's words about an informative opening comment. That was very helpful for me in reviewing this PR.
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davemulford, rbaturov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 14, 2023
Copy link
Contributor

@sferich888 sferich888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few items we want to followup on before we push for testing on this.

@@ -117,5 +117,8 @@ oc adm inspect --dest-dir must-gather --rotated-pod-logs "${all_ns_resources[@]}
# Gather Performance profile information
/usr/bin/gather_ppc

# Gather SNO resources
/usr/bin/gather_sno
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to run this all the time (even on non SNO clusters)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that workload partitioning is an SNO's only feature. but now I understand it being introduced for multi-node clusters from OCP 4.13. so maybe the "gather_sno" naming should be changed.
Anyhow, we want to run this script for each node, on a cluster that has workload partitioning enabled.


local DEBUG_POD_NAME_PREFIX=""

#Start Debug pod force it to stay up until removed in "default" namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not start a debug pod in the default namespace; must-gather already runs in its own namespace (and has admin privileges).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following this template when writing my rule:
https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_core_dumps
Could you please guide me then how to do this properly?

DEBUG_POD_NAME_PREFIX=${1//./}

#Mimic a normal oc call, i.e pause between two successive calls to allow pod to register
sleep 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use a wait for a condition (vs a sleep)?

https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#wait see --for condition=available

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do this below; why is this sleep needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was no sleep then the pod will not register on time, and the 'oc wait' command will resolve with an error.
This what will happen if we omit the "sleep 2" command:
[rbaturov@rbaturov must-gather]$ source collection-scripts/gather_sno INFO: Waiting for SNO system data collection to complete ... error: arguments in resource/name form must have a single resource and name Debug pod for node sno2.r207-sno2.r207.lab.eng.cert.redhat.com never activated [1]+ Done get_system_data_off_sno "${NODE}" INFO: SNO system data collection complete. [rbaturov@rbaturov must-gather]$


#Collect /etc/crio directory
echo "Collecting /etc/crio"
oc cp -n "default" "$debugPod":/host/etc/crio "$NODES_PATH"/"$1"/crio > /dev/null 2>&1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@control-d do we collect this (or similar data in other ways)? Do we need to collect this again (if so)?

oc cp -n "default" "$debugPod":/host/etc/crio "$NODES_PATH"/"$1"/crio > /dev/null 2>&1

#clean up debug pod after we are done using them
oc delete pod "$debugPod" -n "default"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed if the pod; is started in the must-gather namespace.


mkdir -p "${NODES_PATH}"/

for NODE in ${NODES}; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we limit the nodes that are supplied here (to the control plane)? What happens if we have 500 nodes? How long will this collection take?

If we limit this (do we need to collect this by default)?

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 27652e4 and 2 for PR HEAD 48a957a in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2023

@rbaturov: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 7595b36 into openshift:master Aug 14, 2023
3 checks passed
@ingvagabund
Copy link
Member

@davemulford looks like your lgtm accidentally added approve label as well. Also, I don't see any revert button here so we can quickly open a revert pr to undo the changes and resume the review.

@rbaturov given you authored the PR, it will be better if you revert the PR by hand and re-introduce the changes in a new PR so the review can resume.

@sferich888
Copy link
Contributor

/revert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants