Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/baremetal: add additional troubleshooting information #2665

Merged

Conversation

stbenjam
Copy link
Member

baremetal has additional troubleshooting steps related to the use of a
libvirt bootstrap VM that hosts an instance of Ironic. This adds some
additional information beyond the normal OpenShift troubleshooting
material.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 13, 2019
@stbenjam stbenjam closed this Nov 13, 2019
@stbenjam stbenjam reopened this Nov 13, 2019
@stbenjam
Copy link
Member Author

/cc @davidvossel

@stbenjam stbenjam force-pushed the troubleshooting-docs branch 3 times, most recently from 3ddad30 to f06e24b Compare December 11, 2019 15:18
You may want to examine Ironic itself and look at the state of the
hosts. The below file, when named clouds.yaml and placed in the current
working directory, can be used to communicate with Ironic using the
openstack commandline utilities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably reference some ways to install this client, or else revive openshift/ironic-image#17

IMO it'd probably be better to show a way to run it via a container, since we know not all downstream consumers of the installer will have access to repos with openstack RPMs, and they probably won't want to pip install the clients?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we defer this until later until we have a better mechanism? I think distributing it in a container is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think deferring and iterating is fine, so LGTM.

I'm actually wondering how much info we want to include about getting to ironic. I would consider it a bug if a user had to connect to ironic to understand what's going on in the installation process. I would expect the installer to report the details that might be useful in its logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree. #2679, #2009

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack agreed, lets defer it - maybe we can work out part of the gather workflow that retrieves the ironic data needed for debugging

@stbenjam
Copy link
Member Author

@dhellmann @hardys Could you take another look? Thanks!

@hardys
Copy link
Contributor

hardys commented Dec 12, 2019

/approve

baremetal has additional troubleshooting steps related to the use of a
libvirt bootstrap VM that hosts an instance of Ironic. This adds some
additional information beyond the normal OpenShift troubleshooting
material.
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hardys, yprokule

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2019
@hardys
Copy link
Contributor

hardys commented Dec 18, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 18, 2019
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

Comment on lines +227 to +235
You can view the Ironic logs by sshing to the bootstrap VM, and
examining the logs of the `ironic` service, `journalctl -u ironic`. You
may also view the logs of the individual containers:

- `podman logs ipa-downloader`
- `podman logs coreos-downloader`
- `podman logs ironic`
- `podman logs ironic-inspector`
- `podman logs ironic-dnsmasq`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the gather bootstrap already captures all the logs from pods, containers and other things if the bootstrapping fails.
I think it is more helpful to provide troubleshooting documents using the standard patterns instead of creating new steps of SSHing in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, install gather does not work on baremetal yet (#2009), I have a branch in progress to do it, but some things aren't complete as we need to be able to get the IP's for the control plane.

Also, at least on baremetal platform the way things are today, there is a lot of value of ssh'ing to the bootstrap to review the individual logs of Ironic to troubleshoot what's gone wrong, and understanding all the containers that are running. I think it should stay, and remove it when #2009 is addressed.

And, since we're talking about install gather, I discovered some things in the progress of working on support in baremetal and broke some clean-ups out into their own PR which needs review
#2698

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, install gather does not work on baremetal yet (#2009), I have a branch in progress to do it, but some things aren't complete as we need to be able to get the IP's for the control plane.

Also, at least on baremetal platform the way things are today, there is a lot of value of ssh'ing to the bootstrap to review the individual logs of Ironic to troubleshoot what's gone wrong, and understanding all the containers that are running. I think it should stay, and remove it when #2009 is addressed.

This useless to us when we debug our CI runs, because we can't SSH while the test is running, so I don't think there is value in letting users so that too.

Making sure you can trace the bugs or what went wrong from gathered logs after failure is important requirement. and forcing it makes sure we are not loosing logs from containers that are only accessible when running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do get it, I understand the importance of gathering the information in the right way, but I also waste a lot of my day giving these exact troubleshooting steps to people on Slack.

If you want to revert this, I'll just move it somewhere else as it has to exist somewhere. I'd really rather it stay until I can give you install-gather support for the platform.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do get it, I understand the importance of gathering the information in the right way, but I also waste a lot of my day giving these exact troubleshooting steps to people on Slack.

If you want to revert this, I'll just move it somewhere else as it has to exist somewhere. I'd really rather it stay until I can give you install-gather support for the platform.

I wasn't making the case for reverting, rather documenting running the gather command specifying the ip addresses for bootstrap and control-plane machines manually for now and then documenting how to scrub and look for failure in the gather log-bundle.

Also, I don't think this needs to belong anywhere else, it is good in the installer repo, more so not documenting debugging workflows that are not useful debugging CI runs.

@abhinavdahiya
Copy link
Contributor

/lgtm cancel

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 18, 2019
@openshift-merge-robot openshift-merge-robot merged commit e8289c5 into openshift:master Dec 18, 2019
@stbenjam
Copy link
Member Author

Looks like you were too late, I'm happy to update though in a follow-up, but see my reply above.

@openshift-ci-robot
Copy link
Contributor

@stbenjam: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-fips d9bd457 link /test e2e-aws-fips

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@stbenjam stbenjam deleted the troubleshooting-docs branch December 18, 2019 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants