Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Automatically analyze gathered bootstrap logs? #2569

Open
wking opened this issue Oct 24, 2019 · 13 comments
Open

RFE: Automatically analyze gathered bootstrap logs? #2569

wking opened this issue Oct 24, 2019 · 13 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@wking
Copy link
Member

wking commented Oct 24, 2019

When bootstrapping fails but we got far enough in to be able to SSH into the bootstrap machine, the installer automatically gathers a log tarball. But the tarball is probably fairly intimidating to users who aren't on the install team or in the position to be frequently debuggers. And in some cases, the results are sufficiently structured that we can point out a specific problem (e.g. CRI-O failed, #2567) or "you had insufficient creds to pull the release image" (#901). Do we want to teach the installer how to find and highlight some of those cases? We have a fair number of them in the last 24 hours of CI:

$ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?name=-e2e-&maxAge=24h&context=0&search=Bootstrap+gather+logs+captured' | jq -r '. | keys[]' | sed 's|.*-e2e-\([^/]*\)/[0-9]*$|\1|' | sort | uniq -c | sort -n
      1 aws-console-olm
      1 azure-master
      1 azure-upgrade-4.2
      1 gcp-console
      1 gcp-upgrade
      1 libvirt
      1 openstack
      1 vsphere
      1 vsphere-upi-serial-4.2
      2 aws-encryption
      2 aws-ovn-kubernetes
      2 aws-ovs-kubernetes-4.3
      2 aws-upgrade
      2 aws-upi-4.3
      2 azure
      2 cmd
      2 tools
      2 wmcb
      2 wsu
      3 gcp
      3 gcp-op
      4 aws-proxy-4.2
      4 aws-proxy-4.3
      4 aws-scaleup-rhel7
      8 aws
     12 vsphere-upi-4.3
     12 vsphere-upi-serial-4.3
     20 aws-4.3

But currently no way to break those down by underlying cause.

@abhinavdahiya
Copy link
Contributor

I like the idea, do we have an idea regarding

a) how we want to expose this users?
b) do we have a list of initial things we should highlight using this?
c) But currently no way to break those down by underlying cause. So what kind of fingerprints do think will be useful here?

@wking
Copy link
Member Author

wking commented Oct 28, 2019

a) how we want to expose this users?

Log entries, like in #2567 .

b) do we have a list of initial things we should highlight using this?

systemd unit failures, like in #2567 . Also, maybe "you had insufficient creds to pull the release image"? Or "your control plane machines never formed and etcd cluster". "I can't even SSH into your control plane machines". I don't think we need an exhaustive set of things, we can just grow this incrementally as we run into issues in CI or the wild.

c) But currently no way to break those down by underlying cause. So what kind of fingerprints do think will be useful here?

Anytime someone hits a gather and says "I dunno, installer folks, but here's your gathered tarball", I think we should think about whether there is something we could either be fixing so it doesn't happen again or, when that's not possible (e.g. user provides insufficient pull secret), logging a more approachable summary of the underlying issue. The tarball alone is usually going to be sufficient for debugging, but I think we'll have fewer installer-targeted tickets if we provide summaries where we can.

@chancez
Copy link
Contributor

chancez commented Oct 30, 2019

@abhinavdahiya recently dived into some CI logs and found a bootloop related issue it seems. Sounds like it required looking at multiple log entries to determine it though.

@pecameron
Copy link

When installer fails, and the failure has to do with something external to the cluster, installer should report what went wrong and what the user needs to do to resolve it. Typical things may be resource problems in the cloud such as quota exceeded, not enough hosts available, token bad/expired/missing and such. Internal failings are OpenShift bugs that need to be fixed, not reported by installer.
I am thinking of something like messages from "git rebase" or similar.

@sdodson
Copy link
Member

sdodson commented Jan 10, 2020

@wking Where do you think these tools should incubate? In this repo's hack directory? a sub command like #2567 ?

@wking
Copy link
Member Author

wking commented Jan 10, 2020

I like having them in a subcommand so we can run them automatically on behalf of the user (who maybe got the installer binary without our associated hack dir, which is the case for non-UPI CI jobs). But if folks feel that is too much of a risk, I'd be ok landing them outside the installer binary as a stopgap. Ideally in a separate Go command, to make it easier to compile them into the core installer binary if we decide to go that way in the future.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2020
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2020
@wking
Copy link
Member Author

wking commented May 29, 2020

If the only thing holding this up is a desire to one-up systemd on summarizing failures, I'm happy to add that to #2567. Is that really all that we need to unblock this? There's also some contention here between "don't spew unsummarized failures at users" and "sometimes the installer will not have a compact summary". So in order of increased utility:

  1. Not gathering anything. Users can SSH in and look if they want. This is where we started.
  2. Automatically gathering log tarballs. This is where we are today.
  3. Automatically printing anything about the gathered tarball that directs attention to some subset of its content. This is where cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567 would take us for failing-unit cases.
  4. Automatically summarizing a diagnosis in a line or two with a neat little bow. This is certainly worth it for common cases, but seems unobtainable for every possible failure mode.

I don't think getting to 4 should block us from moving to 3, but I'm happy to take #2567's failing-unit output to 4 before it lands if you have ideas about how you'd like those failing units summarized.

@stbenjam
Copy link
Member

stbenjam commented Jun 3, 2020

What's needed to get #2567 rebased and landed? I think it's valuable and would be a good foundation for us to address some baremetal-specific things (see openshift/enhancements#328).

I agree 4 can come later, but if we wanted one neat little bow, we might print something useful if we notice the release-image service failed.

@stbenjam
Copy link
Member

stbenjam commented Jun 3, 2020

/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 3, 2020
@abhinavdahiya
Copy link
Contributor

If the only thing holding this up is a desire to one-up systemd on summarizing failures, I'm happy to add that to #2567. Is that really all that we need to unblock this? There's also some contention here between "don't spew unsummarized failures at users" and "sometimes the installer will not have a compact summary". So in order of increased utility:

1. Not gathering anything.  Users can SSH in and look if they want.  This is where we started.

2. Automatically gathering log tarballs.  This is where we are today.

3. Automatically printing _anything_ about the gathered tarball that directs attention to some subset of its content.  This is where #2567 would take us for failing-unit cases.

4. Automatically summarizing a diagnosis in a line or two with a neat little bow.  This is certainly worth it for common cases, but seems unobtainable for every possible failure mode.

I don't think getting to 4 should block us from moving to 3, but I'm happy to take #2567's failing-unit output to 4 before it lands if you have ideas about how you'd like those failing units summarized.

Currently the summarize the cluster install failure does (3) i.e. it shows all the conditions of the cluster operators. Looking at all the BZs that assigned to the installer and people asking why installer failed in the slack shows that (3) has proven to add no major value to directing users to correct operators.

So your point that (4) shouldn't block us to do (3) imo is not very useful. I think we need to target (4).
The goal should be to provide the user exact, actionable error so that the user is not looking all the possible things to understand where the error is. Even if we can only capture/identify few errors those are better than printing bunch of information that could be many different errors.

@wking
Copy link
Member Author

wking commented Jun 11, 2020

Even if we can only capture/identify few errors those are better than printing bunch of information that could be many different errors.

Want to pick a first thing? Like "Failed to fetch the release image"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

8 participants