New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added gathering script for SNOs with workload partitioning #373
Added gathering script for SNOs with workload partitioning #373
Conversation
Hi @rbaturov. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thank you for the informative opening comment (including link to workload partitioning, etc.)
Echoing Rick's words about an informative opening comment. That was very helpful for me in reviewing this PR. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: davemulford, rbaturov The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a few items we want to followup on before we push for testing on this.
@@ -117,5 +117,8 @@ oc adm inspect --dest-dir must-gather --rotated-pod-logs "${all_ns_resources[@]} | |||
# Gather Performance profile information | |||
/usr/bin/gather_ppc | |||
|
|||
# Gather SNO resources | |||
/usr/bin/gather_sno |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to run this all the time (even on non SNO clusters)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that workload partitioning is an SNO's only feature. but now I understand it being introduced for multi-node clusters from OCP 4.13. so maybe the "gather_sno" naming should be changed.
Anyhow, we want to run this script for each node, on a cluster that has workload partitioning enabled.
|
||
local DEBUG_POD_NAME_PREFIX="" | ||
|
||
#Start Debug pod force it to stay up until removed in "default" namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not start a debug pod in the default namespace; must-gather already runs in its own namespace (and has admin privileges).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following this template when writing my rule:
https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_core_dumps
Could you please guide me then how to do this properly?
DEBUG_POD_NAME_PREFIX=${1//./} | ||
|
||
#Mimic a normal oc call, i.e pause between two successive calls to allow pod to register | ||
sleep 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you use a wait for a condition (vs a sleep)?
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#wait see --for condition=available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do this below; why is this sleep needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was no sleep then the pod will not register on time, and the 'oc wait' command will resolve with an error.
This what will happen if we omit the "sleep 2" command:
[rbaturov@rbaturov must-gather]$ source collection-scripts/gather_sno INFO: Waiting for SNO system data collection to complete ... error: arguments in resource/name form must have a single resource and name Debug pod for node sno2.r207-sno2.r207.lab.eng.cert.redhat.com never activated [1]+ Done get_system_data_off_sno "${NODE}" INFO: SNO system data collection complete. [rbaturov@rbaturov must-gather]$
|
||
#Collect /etc/crio directory | ||
echo "Collecting /etc/crio" | ||
oc cp -n "default" "$debugPod":/host/etc/crio "$NODES_PATH"/"$1"/crio > /dev/null 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@control-d do we collect this (or similar data in other ways)? Do we need to collect this again (if so)?
oc cp -n "default" "$debugPod":/host/etc/crio "$NODES_PATH"/"$1"/crio > /dev/null 2>&1 | ||
|
||
#clean up debug pod after we are done using them | ||
oc delete pod "$debugPod" -n "default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed if the pod; is started in the must-gather namespace.
|
||
mkdir -p "${NODES_PATH}"/ | ||
|
||
for NODE in ${NODES}; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we limit the nodes that are supplied here (to the control plane)? What happens if we have 500 nodes? How long will this collection take?
If we limit this (do we need to collect this by default)?
@rbaturov: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@davemulford looks like your lgtm accidentally added approve label as well. Also, I don't see any @rbaturov given you authored the PR, it will be better if you revert the PR by hand and re-introduce the changes in a new PR so the review can resume. |
/revert |
This PR adds support for collecting data needed for validating SNO's with workload partitioning enabled.
The scripts collects:
The script starts to collect these files only after checking workload partitioning flags indicating the SNO has workload partitioning enabled.
https://issues.redhat.com/browse/TELCOSTRAT-156