RFE: Automatically analyze gathered bootstrap logs? #2569

wking · 2019-10-24T22:53:38Z

When bootstrapping fails but we got far enough in to be able to SSH into the bootstrap machine, the installer automatically gathers a log tarball. But the tarball is probably fairly intimidating to users who aren't on the install team or in the position to be frequently debuggers. And in some cases, the results are sufficiently structured that we can point out a specific problem (e.g. CRI-O failed, #2567) or "you had insufficient creds to pull the release image" (#901). Do we want to teach the installer how to find and highlight some of those cases? We have a fair number of them in the last 24 hours of CI:

$ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?name=-e2e-&maxAge=24h&context=0&search=Bootstrap+gather+logs+captured' | jq -r '. | keys[]' | sed 's|.*-e2e-\([^/]*\)/[0-9]*$|\1|' | sort | uniq -c | sort -n
      1 aws-console-olm
      1 azure-master
      1 azure-upgrade-4.2
      1 gcp-console
      1 gcp-upgrade
      1 libvirt
      1 openstack
      1 vsphere
      1 vsphere-upi-serial-4.2
      2 aws-encryption
      2 aws-ovn-kubernetes
      2 aws-ovs-kubernetes-4.3
      2 aws-upgrade
      2 aws-upi-4.3
      2 azure
      2 cmd
      2 tools
      2 wmcb
      2 wsu
      3 gcp
      3 gcp-op
      4 aws-proxy-4.2
      4 aws-proxy-4.3
      4 aws-scaleup-rhel7
      8 aws
     12 vsphere-upi-4.3
     12 vsphere-upi-serial-4.3
     20 aws-4.3

But currently no way to break those down by underlying cause.

abhinavdahiya · 2019-10-28T16:33:29Z

I like the idea, do we have an idea regarding

a) how we want to expose this users?
b) do we have a list of initial things we should highlight using this?
c) But currently no way to break those down by underlying cause. So what kind of fingerprints do think will be useful here?

wking · 2019-10-28T21:21:46Z

a) how we want to expose this users?

Log entries, like in #2567 .

b) do we have a list of initial things we should highlight using this?

systemd unit failures, like in #2567 . Also, maybe "you had insufficient creds to pull the release image"? Or "your control plane machines never formed and etcd cluster". "I can't even SSH into your control plane machines". I don't think we need an exhaustive set of things, we can just grow this incrementally as we run into issues in CI or the wild.

c) But currently no way to break those down by underlying cause. So what kind of fingerprints do think will be useful here?

Anytime someone hits a gather and says "I dunno, installer folks, but here's your gathered tarball", I think we should think about whether there is something we could either be fixing so it doesn't happen again or, when that's not possible (e.g. user provides insufficient pull secret), logging a more approachable summary of the underlying issue. The tarball alone is usually going to be sufficient for debugging, but I think we'll have fewer installer-targeted tickets if we provide summaries where we can.

chancez · 2019-10-30T23:20:01Z

@abhinavdahiya recently dived into some CI logs and found a bootloop related issue it seems. Sounds like it required looking at multiple log entries to determine it though.

pecameron · 2020-01-09T21:20:21Z

When installer fails, and the failure has to do with something external to the cluster, installer should report what went wrong and what the user needs to do to resolve it. Typical things may be resource problems in the cloud such as quota exceeded, not enough hosts available, token bad/expired/missing and such. Internal failings are OpenShift bugs that need to be fixed, not reported by installer.
I am thinking of something like messages from "git rebase" or similar.

sdodson · 2020-01-10T14:39:52Z

@wking Where do you think these tools should incubate? In this repo's hack directory? a sub command like #2567 ?

wking · 2020-01-10T18:19:56Z

I like having them in a subcommand so we can run them automatically on behalf of the user (who maybe got the installer binary without our associated hack dir, which is the case for non-UPI CI jobs). But if folks feel that is too much of a risk, I'd be ok landing them outside the installer binary as a stopgap. Ideally in a separate Go command, to make it easier to compile them into the core installer binary if we decide to go that way in the future.

openshift-bot · 2020-04-09T18:53:33Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-05-09T20:53:37Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

wking · 2020-05-29T22:47:06Z

If the only thing holding this up is a desire to one-up systemd on summarizing failures, I'm happy to add that to #2567. Is that really all that we need to unblock this? There's also some contention here between "don't spew unsummarized failures at users" and "sometimes the installer will not have a compact summary". So in order of increased utility:

Not gathering anything. Users can SSH in and look if they want. This is where we started.
Automatically gathering log tarballs. This is where we are today.
Automatically printing anything about the gathered tarball that directs attention to some subset of its content. This is where cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567 would take us for failing-unit cases.
Automatically summarizing a diagnosis in a line or two with a neat little bow. This is certainly worth it for common cases, but seems unobtainable for every possible failure mode.

I don't think getting to 4 should block us from moving to 3, but I'm happy to take #2567's failing-unit output to 4 before it lands if you have ideas about how you'd like those failing units summarized.

stbenjam · 2020-06-03T15:05:46Z

What's needed to get #2567 rebased and landed? I think it's valuable and would be a good foundation for us to address some baremetal-specific things (see openshift/enhancements#328).

I agree 4 can come later, but if we wanted one neat little bow, we might print something useful if we notice the release-image service failed.

stbenjam · 2020-06-03T15:11:19Z

/lifecycle frozen

abhinavdahiya · 2020-06-11T04:06:51Z

If the only thing holding this up is a desire to one-up systemd on summarizing failures, I'm happy to add that to #2567. Is that really all that we need to unblock this? There's also some contention here between "don't spew unsummarized failures at users" and "sometimes the installer will not have a compact summary". So in order of increased utility:
1. Not gathering anything.  Users can SSH in and look if they want.  This is where we started.

2. Automatically gathering log tarballs.  This is where we are today.

3. Automatically printing _anything_ about the gathered tarball that directs attention to some subset of its content.  This is where #2567 would take us for failing-unit cases.

4. Automatically summarizing a diagnosis in a line or two with a neat little bow.  This is certainly worth it for common cases, but seems unobtainable for every possible failure mode.
I don't think getting to 4 should block us from moving to 3, but I'm happy to take #2567's failing-unit output to 4 before it lands if you have ideas about how you'd like those failing units summarized.

Currently the summarize the cluster install failure does (3) i.e. it shows all the conditions of the cluster operators. Looking at all the BZs that assigned to the installer and people asking why installer failed in the slack shows that (3) has proven to add no major value to directing users to correct operators.

So your point that (4) shouldn't block us to do (3) imo is not very useful. I think we need to target (4).
The goal should be to provide the user exact, actionable error so that the user is not looking all the possible things to understand where the error is. Even if we can only capture/identify few errors those are better than printing bunch of information that could be many different errors.

wking · 2020-06-11T04:12:42Z

Even if we can only capture/identify few errors those are better than printing bunch of information that could be many different errors.

Want to pick a first thing? Like "Failed to fetch the release image"?

wking mentioned this issue Oct 24, 2019

cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

Closed

abhinavdahiya mentioned this issue Nov 20, 2019

installer should throw useful error message when pull secret is expired #2689

Closed

wking mentioned this issue Jan 11, 2020

data/bootstrap/files/etc/motd: Mention release-image.service #2884

Merged

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2020

wking mentioned this issue Apr 26, 2020

docs/user: add troubleshootingbootstrap to define the bootstrap log bundle #3506

Merged

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2020

abhinavdahiya mentioned this issue May 15, 2020

baremetal: improve debuggability of ipi deployments openshift/enhancements#328

Closed

openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 3, 2020

wking mentioned this issue Jun 15, 2020

check early if nodes have a hostname of localhost and error #3751

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: Automatically analyze gathered bootstrap logs? #2569

RFE: Automatically analyze gathered bootstrap logs? #2569

wking commented Oct 24, 2019

abhinavdahiya commented Oct 28, 2019

wking commented Oct 28, 2019

chancez commented Oct 30, 2019

pecameron commented Jan 9, 2020

sdodson commented Jan 10, 2020

wking commented Jan 10, 2020

openshift-bot commented Apr 9, 2020

openshift-bot commented May 9, 2020

wking commented May 29, 2020 •

edited

stbenjam commented Jun 3, 2020 •

edited

stbenjam commented Jun 3, 2020

abhinavdahiya commented Jun 11, 2020

wking commented Jun 11, 2020 •

edited

RFE: Automatically analyze gathered bootstrap logs? #2569

RFE: Automatically analyze gathered bootstrap logs? #2569

Comments

wking commented Oct 24, 2019

abhinavdahiya commented Oct 28, 2019

wking commented Oct 28, 2019

chancez commented Oct 30, 2019

pecameron commented Jan 9, 2020

sdodson commented Jan 10, 2020

wking commented Jan 10, 2020

openshift-bot commented Apr 9, 2020

openshift-bot commented May 9, 2020

wking commented May 29, 2020 • edited

stbenjam commented Jun 3, 2020 • edited

stbenjam commented Jun 3, 2020

abhinavdahiya commented Jun 11, 2020

wking commented Jun 11, 2020 • edited

wking commented May 29, 2020 •

edited

stbenjam commented Jun 3, 2020 •

edited

wking commented Jun 11, 2020 •

edited