Skip to content

add must-gather command#22430

Merged
deads2k merged 1 commit intoopenshift:masterfrom
sanchezl:oc_adm_must-gather
Apr 15, 2019
Merged

add must-gather command#22430
deads2k merged 1 commit intoopenshift:masterfrom
sanchezl:oc_adm_must-gather

Conversation

@sanchezl
Copy link
Copy Markdown
Contributor

@sanchezl sanchezl commented Mar 28, 2019

Launch a pod to gather debugging information 

This command will launch a pod in a temporary namespace on your cluster that gathers debugging
information, using a copy of the active client config context, and then downloads the gathered
information.

Usage:
  oc adm must-gather [flags]

Examples:
  # gather default information using the default image and command, writing into ./must-gather.local.<rand>
  oc adm must-gather
  
  # gather default information with a specific local folder to copy to
  oc adm must-gather --dest-dir=/local/directory
  
  # gather default information using a specific image, command, and pod-dir
  oc adm must-gather --image=my/image:tag --source-dir=/pod/directory -- myspecial-command.sh

Options:
      --dest-dir='': Set a specific directory on the local machine to write gathered data to.
      --image='': Set a specific to use - by default the image will be looked up for OpenShift's must-gather
      --node-name='': Set a specific node to use - by default a random master will be used

Picking up from #22405

  • Create a temporary namespace
  • Create a cluster role binding for the default service account for the temporary namespace.
  • Create a pod that runs the 'gather' command in that image.
  • Wait for completion of that command.
  • Look up the image to use for gathering information
  • Copy the gathered data locally
  • Delete the temporary namespace.
  • Delete the cluster role binding.

https://jira.coreos.com/browse/MSTR-351

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 28, 2019
@openshift-ci-robot openshift-ci-robot requested a review from enj March 28, 2019 19:07
@sanchezl sanchezl force-pushed the oc_adm_must-gather branch from 14351d3 to 613c04f Compare March 28, 2019 19:58
@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Mar 29, 2019

@sanchezl add a phase-in plan for the description please.

@soltysh soltysh self-assigned this Mar 29, 2019
Copy link
Copy Markdown
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left you some initial comments....


mustGatherExample = templates.Examples(`
# gather default information using the default image and command, writing into ./must-gather.local.<rand>
oc adm must-gather
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try not to embed the name in the examples, use %[1]s instead and then
Example: fmt.Sprintf(mustGatherExample, fullName)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try not to embed the name in the examples, use %[1]s instead and then
Example: fmt.Sprintf(mustGatherExample, fullName)

@soltysh I don't think we're gaining much with that, do you? Clayton came to a similar conclusion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know tools that are consuming us, and just trying to be good citizen and allow that consumption. Also I'm slowly accepting the fact that both kubectl and oc are primitives that folks use to build their opinionated flows on top (see https://www.youtube.com/watch?v=ytu3aUCwlSg for example how AirBnB is wrapping kubectl just for that). But that's not a strong requirement, yet it doesn't cost us that much to have it done this way.

)

type MustGatherFlags struct {
ConfigFlags *genericclioptions.ConfigFlags
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not need this, since you're part of oc this is implicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer applicable.

}

func NewMustGatherCommand(restClientGetter genericclioptions.RESTClientGetter, streams genericclioptions.IOStreams) *cobra.Command {
f := NewMustGatherFlags(streams)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: o is the preferred var name for option structs, this applies to entire file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Long: mustGatherLong,
Example: mustGatherExample,
Run: func(cmd *cobra.Command, args []string) {
kcmdutil.CheckErr(f.Complete(cmd, restClientGetter, args).RunMustGather())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kcmdutil.CheckErr(o.Complete(f, cmd, args))
kcmdutil.CheckErr(o.Run())

is preferred.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

cmd.Flags().StringVar(&f.NodeName, "node-name", f.NodeName, "Set a specific node to use - by default a random master will be used")
cmd.Flags().StringVar(&f.Image, "image", f.Image, "Set a specific to use - by default the image will be looked up for OpenShift's must-gather")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set a specific image to use, by default the OpenShift's must-gather image will be used.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

func (f *MustGatherFlags) Complete(cmd *cobra.Command, restClientGetter genericclioptions.RESTClientGetter, args []string) *MustGatherOptions {
o := &MustGatherOptions{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something smells here. You should have only one MustGatherOptions struct and work from there through Complete -> Validate (if needed) -> Run. That is the required flow for all commands.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return nil
}

func (o *MustGatherOptions) newPod(node, ns string) *corev1.Pod {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ns is not being used

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used now.

@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Mar 29, 2019

Picking up from #22405

  • Create a temporary namespace
  • Create a secret an upload the local kubeconfig content into it
  • Look up the image to use for gathering information
  • Create a pod that runs the 'gather' command in that image
  • Wait for completion of that command
  • Copy the gathered data locally

Let's get this into a Jira story. Add deleting the namespace. Mark the command hidden and deprecated and see about starting to phase this in with @soltysh and reasonable tests on subsets.

Redorder the pod creation and the image lookup. Use a floating tag on quay to start this. Describe a way to recognize from the client that the command has finished.

@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Mar 29, 2019

  • Create a secret an upload the local kubeconfig content into it

If we didn't do this, could we use a cluster-reader + secret/metrics/debug/health reader assigned to a service account with an audience.

@enj

@soltysh
Copy link
Copy Markdown
Contributor

soltysh commented Apr 1, 2019

Let's get this into a Jira story. Add deleting the namespace. Mark the command hidden and deprecated and see about starting to phase this in with @soltysh and reasonable tests on subsets.

Hidden 👍, but why deprecated, better experimental. Let's make it clear it's a moving target.

@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Apr 1, 2019

but why deprecated, better experimenta

sure

@sferich888
Copy link
Copy Markdown
Contributor

Updated openshift/must-gather#65 as it will be needed in tandem with this command.

@sanchezl sanchezl force-pushed the oc_adm_must-gather branch 2 times, most recently from b7b6398 to 97f4ee5 Compare April 1, 2019 20:48
if len(o.Image) == 0 {
// TODO lookup cluster specific default
// Image: "quay.io/openshift-release-dev/ocp-v4.0-art-dev:v4.0.5-1-ose-must-gather",
o.Image = "sanchezl/ocp-v4.0-art-dev:v4.0.5-1-ose-must-gather"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quay.io/openshift/origin-must-gather:latest I'd think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// TODO This command will:
// [x] Create a temporary namespace
// [x] Create a secret an upload the local kubeconfig content into it
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not do this and grant an aggregator cluster-role to the service account in this namespace

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplest possible thing to start. Grant it cluster-admin to land

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with cluster-admin for now. Opened #22430 as a reminder to fix in the future.


fmt.Fprintf(o.Out, "Created ns/%s\n", ns.Name)

kubeConfigSecret, err := o.newKubeConfigSecret(ns.Name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's stop doin this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Command: []string{"openshift-must-gather", "inspect", "clusteroperator"},
Env: []corev1.EnvVar{
{
Name: "KUBECONFIG",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

VolumeMounts: []corev1.VolumeMount{
{
Name: "kubeconfig",
MountPath: "/etc/kubernetes",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, this is clever. Was this me or you?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me. It's gone now.

@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Apr 1, 2019

@sanchezl take out of WIP, add a test-cmd test for it test/cmd/admin.sh I'd think.

@sanchezl sanchezl force-pushed the oc_adm_must-gather branch 2 times, most recently from 41b7d1d to 87bd873 Compare April 2, 2019 18:32
@sanchezl sanchezl changed the title [WIP] add must-gather command add must-gather command Apr 2, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2019
@smarterclayton
Copy link
Copy Markdown
Contributor

What happens when launching pods won't work. How do I must-gather then?

@sferich888
Copy link
Copy Markdown
Contributor

must-gather can be run as a CLI plug-in (or standalone binary) too. The benefit of an image is that it let's us ship the binary as part of a release (connect but independent), with oc we don't have the same mechanism (thus why a plug-in was not considered).

In short this is a short comming of this implementation (by using an image and the oc shim to call start the image and exfiltrate data), however if your cluster is in that bad of shape your going to likely need host level access to debug issues that are likely not cluster related.

@sanchezl
Copy link
Copy Markdown
Contributor Author

sanchezl commented Apr 3, 2019

/test verify

@sanchezl sanchezl force-pushed the oc_adm_must-gather branch 2 times, most recently from f5b2970 to 113afe2 Compare April 9, 2019 21:14
Copy link
Copy Markdown
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more fresh comments + some old ones are still not addressed.

if err != nil {
return err
}
if podPhase != corev1.PodRunning {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not addressed.

}
if len(o.Image) == 0 {
// TODO lookup cluster specific default
o.Image = "quay.io/openshift/origin-must-gather:v4.0"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the default in NewMustGatherOptions, this way the default will be nicely presented in the command's help.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the real default, see #22528.

@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Apr 10, 2019

verify failure is real again because you're relying on a new package. Whitelist all of k8s.io/apimachinery to save trouble

@sanchezl sanchezl force-pushed the oc_adm_must-gather branch from 113afe2 to dc0df0f Compare April 10, 2019 19:05
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 10, 2019
@sanchezl sanchezl force-pushed the oc_adm_must-gather branch 3 times, most recently from 5a7b7af to 9d89771 Compare April 11, 2019 05:57
@sanchezl
Copy link
Copy Markdown
Contributor Author

/retest

@sanchezl sanchezl force-pushed the oc_adm_must-gather branch 2 times, most recently from a2b8706 to da8f59d Compare April 12, 2019 00:36
@sferich888
Copy link
Copy Markdown
Contributor

/retest

@sanchezl sanchezl force-pushed the oc_adm_must-gather branch from da8f59d to 5089636 Compare April 12, 2019 02:30
if err != nil {
return err
}
if phase != corev1.PodRunning {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch from pending to running in most cases should be fast, but why wait for pending and then check its status after wait, and not just wait for running in wait.PollImmediate ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might go from Pending to Failed (or even Success or Completed). With a RestartPolicy: Never, I would be waiting for a transition that might never happen if I look for Running specifically. This way we fail ASAP, instead of timing out the wait.PollIntermediate call.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but the transition from Pending to Running might not be as fast as this check and you will fail then, b/c pod is still in pending. Maybe you should have a more explicit checks in place then.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I misread the condition in wait.PollImmediate, it's 👍

err := wait.PollImmediate(time.Second, 10*time.Minute, func() (bool, error) {
var err error
if pod, err = o.Client.CoreV1().Pods(pod.Namespace).Get(pod.Name, metav1.GetOptions{}); err != nil {
klog.Error(err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't log the error here, if you care, log it at higher debug levels, it's not an error per se.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in PR #22561

@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Apr 12, 2019

I see that fake test. Start an e2e one

/lgtm
/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2019
@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Apr 12, 2019

I see that fake test. Start an e2e one

/lgtm
/retest

This is how bad I want this in a beta.

@openshift-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sanchezl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Apr 12, 2019

ci-operator failure

/retest

@deads2k
Copy link
Copy Markdown
Contributor

deads2k commented Apr 13, 2019

/test all

@sanchezl
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@sanchezl
Copy link
Copy Markdown
Contributor Author

/retest

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Copy Markdown

@sanchezl: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws 5089636 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@deads2k deads2k merged commit 5089636 into openshift:master Apr 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants