add must-gather command#22430
Conversation
14351d3 to
613c04f
Compare
|
@sanchezl add a phase-in plan for the description please. |
soltysh
left a comment
There was a problem hiding this comment.
Left you some initial comments....
|
|
||
| mustGatherExample = templates.Examples(` | ||
| # gather default information using the default image and command, writing into ./must-gather.local.<rand> | ||
| oc adm must-gather |
There was a problem hiding this comment.
We try not to embed the name in the examples, use %[1]s instead and then
Example: fmt.Sprintf(mustGatherExample, fullName)
There was a problem hiding this comment.
We try not to embed the name in the examples, use
%[1]sinstead and then
Example: fmt.Sprintf(mustGatherExample, fullName)
@soltysh I don't think we're gaining much with that, do you? Clayton came to a similar conclusion.
There was a problem hiding this comment.
I know tools that are consuming us, and just trying to be good citizen and allow that consumption. Also I'm slowly accepting the fact that both kubectl and oc are primitives that folks use to build their opinionated flows on top (see https://www.youtube.com/watch?v=ytu3aUCwlSg for example how AirBnB is wrapping kubectl just for that). But that's not a strong requirement, yet it doesn't cost us that much to have it done this way.
| ) | ||
|
|
||
| type MustGatherFlags struct { | ||
| ConfigFlags *genericclioptions.ConfigFlags |
There was a problem hiding this comment.
You should not need this, since you're part of oc this is implicit.
There was a problem hiding this comment.
No longer applicable.
| } | ||
|
|
||
| func NewMustGatherCommand(restClientGetter genericclioptions.RESTClientGetter, streams genericclioptions.IOStreams) *cobra.Command { | ||
| f := NewMustGatherFlags(streams) |
There was a problem hiding this comment.
Nit: o is the preferred var name for option structs, this applies to entire file.
| Long: mustGatherLong, | ||
| Example: mustGatherExample, | ||
| Run: func(cmd *cobra.Command, args []string) { | ||
| kcmdutil.CheckErr(f.Complete(cmd, restClientGetter, args).RunMustGather()) |
There was a problem hiding this comment.
kcmdutil.CheckErr(o.Complete(f, cmd, args))
kcmdutil.CheckErr(o.Run())
is preferred.
| } | ||
|
|
||
| cmd.Flags().StringVar(&f.NodeName, "node-name", f.NodeName, "Set a specific node to use - by default a random master will be used") | ||
| cmd.Flags().StringVar(&f.Image, "image", f.Image, "Set a specific to use - by default the image will be looked up for OpenShift's must-gather") |
There was a problem hiding this comment.
Set a specific image to use, by default the OpenShift's must-gather image will be used.
| } | ||
|
|
||
| func (f *MustGatherFlags) Complete(cmd *cobra.Command, restClientGetter genericclioptions.RESTClientGetter, args []string) *MustGatherOptions { | ||
| o := &MustGatherOptions{ |
There was a problem hiding this comment.
Something smells here. You should have only one MustGatherOptions struct and work from there through Complete -> Validate (if needed) -> Run. That is the required flow for all commands.
| return nil | ||
| } | ||
|
|
||
| func (o *MustGatherOptions) newPod(node, ns string) *corev1.Pod { |
Let's get this into a Jira story. Add deleting the namespace. Mark the command hidden and deprecated and see about starting to phase this in with @soltysh and reasonable tests on subsets. Redorder the pod creation and the image lookup. Use a floating tag on quay to start this. Describe a way to recognize from the client that the command has finished. |
If we didn't do this, could we use a cluster-reader + secret/metrics/debug/health reader assigned to a service account with an audience. |
Hidden 👍, but why deprecated, better experimental. Let's make it clear it's a moving target. |
sure |
|
Updated openshift/must-gather#65 as it will be needed in tandem with this command. |
b7b6398 to
97f4ee5
Compare
| if len(o.Image) == 0 { | ||
| // TODO lookup cluster specific default | ||
| // Image: "quay.io/openshift-release-dev/ocp-v4.0-art-dev:v4.0.5-1-ose-must-gather", | ||
| o.Image = "sanchezl/ocp-v4.0-art-dev:v4.0.5-1-ose-must-gather" |
There was a problem hiding this comment.
quay.io/openshift/origin-must-gather:latest I'd think
|
|
||
| // TODO This command will: | ||
| // [x] Create a temporary namespace | ||
| // [x] Create a secret an upload the local kubeconfig content into it |
There was a problem hiding this comment.
let's not do this and grant an aggregator cluster-role to the service account in this namespace
There was a problem hiding this comment.
Simplest possible thing to start. Grant it cluster-admin to land
There was a problem hiding this comment.
Went with cluster-admin for now. Opened #22430 as a reminder to fix in the future.
|
|
||
| fmt.Fprintf(o.Out, "Created ns/%s\n", ns.Name) | ||
|
|
||
| kubeConfigSecret, err := o.newKubeConfigSecret(ns.Name) |
| Command: []string{"openshift-must-gather", "inspect", "clusteroperator"}, | ||
| Env: []corev1.EnvVar{ | ||
| { | ||
| Name: "KUBECONFIG", |
| VolumeMounts: []corev1.VolumeMount{ | ||
| { | ||
| Name: "kubeconfig", | ||
| MountPath: "/etc/kubernetes", |
There was a problem hiding this comment.
ha, this is clever. Was this me or you?
|
@sanchezl take out of WIP, add a test-cmd test for it |
41b7d1d to
87bd873
Compare
|
What happens when launching pods won't work. How do I must-gather then? |
|
must-gather can be run as a CLI plug-in (or standalone binary) too. The benefit of an image is that it let's us ship the binary as part of a release (connect but independent), with oc we don't have the same mechanism (thus why a plug-in was not considered). In short this is a short comming of this implementation (by using an image and the oc shim to call start the image and exfiltrate data), however if your cluster is in that bad of shape your going to likely need host level access to debug issues that are likely not cluster related. |
|
/test verify |
f5b2970 to
113afe2
Compare
soltysh
left a comment
There was a problem hiding this comment.
A few more fresh comments + some old ones are still not addressed.
| if err != nil { | ||
| return err | ||
| } | ||
| if podPhase != corev1.PodRunning { |
There was a problem hiding this comment.
This is still not addressed.
| } | ||
| if len(o.Image) == 0 { | ||
| // TODO lookup cluster specific default | ||
| o.Image = "quay.io/openshift/origin-must-gather:v4.0" |
There was a problem hiding this comment.
Set the default in NewMustGatherOptions, this way the default will be nicely presented in the command's help.
There was a problem hiding this comment.
This isn't the real default, see #22528.
|
verify failure is real again because you're relying on a new package. Whitelist all of k8s.io/apimachinery to save trouble |
113afe2 to
dc0df0f
Compare
5a7b7af to
9d89771
Compare
|
/retest |
a2b8706 to
da8f59d
Compare
|
/retest |
da8f59d to
5089636
Compare
| if err != nil { | ||
| return err | ||
| } | ||
| if phase != corev1.PodRunning { |
There was a problem hiding this comment.
The switch from pending to running in most cases should be fast, but why wait for pending and then check its status after wait, and not just wait for running in wait.PollImmediate ?
There was a problem hiding this comment.
It might go from Pending to Failed (or even Success or Completed). With a RestartPolicy: Never, I would be waiting for a transition that might never happen if I look for Running specifically. This way we fail ASAP, instead of timing out the wait.PollIntermediate call.
There was a problem hiding this comment.
Right but the transition from Pending to Running might not be as fast as this check and you will fail then, b/c pod is still in pending. Maybe you should have a more explicit checks in place then.
There was a problem hiding this comment.
My bad, I misread the condition in wait.PollImmediate, it's 👍
| err := wait.PollImmediate(time.Second, 10*time.Minute, func() (bool, error) { | ||
| var err error | ||
| if pod, err = o.Client.CoreV1().Pods(pod.Namespace).Get(pod.Name, metav1.GetOptions{}); err != nil { | ||
| klog.Error(err) |
There was a problem hiding this comment.
Don't log the error here, if you care, log it at higher debug levels, it's not an error per se.
|
I see that fake test. Start an e2e one /lgtm |
This is how bad I want this in a beta. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, sanchezl The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
ci-operator failure /retest |
|
/test all |
|
/test e2e-aws |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@sanchezl: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Picking up from #22405
https://jira.coreos.com/browse/MSTR-351