New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs/baremetal: add additional troubleshooting information #2665
docs/baremetal: add additional troubleshooting information #2665
Conversation
/cc @davidvossel |
87aaac3
to
8fc52ea
Compare
3ddad30
to
f06e24b
Compare
You may want to examine Ironic itself and look at the state of the | ||
hosts. The below file, when named clouds.yaml and placed in the current | ||
working directory, can be used to communicate with Ironic using the | ||
openstack commandline utilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably reference some ways to install this client, or else revive openshift/ironic-image#17
IMO it'd probably be better to show a way to run it via a container, since we know not all downstream consumers of the installer will have access to repos with openstack RPMs, and they probably won't want to pip install the clients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we defer this until later until we have a better mechanism? I think distributing it in a container is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think deferring and iterating is fine, so LGTM.
I'm actually wondering how much info we want to include about getting to ironic. I would consider it a bug if a user had to connect to ironic to understand what's going on in the installation process. I would expect the installer to report the details that might be useful in its logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack agreed, lets defer it - maybe we can work out part of the gather workflow that retrieves the ironic data needed for debugging
f06e24b
to
697e803
Compare
@dhellmann @hardys Could you take another look? Thanks! |
/approve |
baremetal has additional troubleshooting steps related to the use of a libvirt bootstrap VM that hosts an instance of Ironic. This adds some additional information beyond the normal OpenShift troubleshooting material.
697e803
to
d9bd457
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hardys, yprokule The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
You can view the Ironic logs by sshing to the bootstrap VM, and | ||
examining the logs of the `ironic` service, `journalctl -u ironic`. You | ||
may also view the logs of the individual containers: | ||
|
||
- `podman logs ipa-downloader` | ||
- `podman logs coreos-downloader` | ||
- `podman logs ironic` | ||
- `podman logs ironic-inspector` | ||
- `podman logs ironic-dnsmasq` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the gather bootstrap already captures all the logs from pods, containers and other things if the bootstrapping fails.
I think it is more helpful to provide troubleshooting documents using the standard patterns instead of creating new steps of SSHing in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, install gather does not work on baremetal yet (#2009), I have a branch in progress to do it, but some things aren't complete as we need to be able to get the IP's for the control plane.
Also, at least on baremetal platform the way things are today, there is a lot of value of ssh'ing to the bootstrap to review the individual logs of Ironic to troubleshoot what's gone wrong, and understanding all the containers that are running. I think it should stay, and remove it when #2009 is addressed.
And, since we're talking about install gather, I discovered some things in the progress of working on support in baremetal and broke some clean-ups out into their own PR which needs review
#2698
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, install gather does not work on baremetal yet (#2009), I have a branch in progress to do it, but some things aren't complete as we need to be able to get the IP's for the control plane.
Also, at least on baremetal platform the way things are today, there is a lot of value of ssh'ing to the bootstrap to review the individual logs of Ironic to troubleshoot what's gone wrong, and understanding all the containers that are running. I think it should stay, and remove it when #2009 is addressed.
This useless to us when we debug our CI runs, because we can't SSH while the test is running, so I don't think there is value in letting users so that too.
Making sure you can trace the bugs or what went wrong from gathered logs after failure is important requirement. and forcing it makes sure we are not loosing logs from containers that are only accessible when running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do get it, I understand the importance of gathering the information in the right way, but I also waste a lot of my day giving these exact troubleshooting steps to people on Slack.
If you want to revert this, I'll just move it somewhere else as it has to exist somewhere. I'd really rather it stay until I can give you install-gather support for the platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do get it, I understand the importance of gathering the information in the right way, but I also waste a lot of my day giving these exact troubleshooting steps to people on Slack.
If you want to revert this, I'll just move it somewhere else as it has to exist somewhere. I'd really rather it stay until I can give you install-gather support for the platform.
I wasn't making the case for reverting, rather documenting running the gather command specifying the ip addresses for bootstrap and control-plane machines manually for now and then documenting how to scrub and look for failure in the gather log-bundle.
Also, I don't think this needs to belong anywhere else, it is good in the installer repo, more so not documenting debugging workflows that are not useful debugging CI runs.
/lgtm cancel |
Looks like you were too late, I'm happy to update though in a follow-up, but see my reply above. |
@stbenjam: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
baremetal has additional troubleshooting steps related to the use of a
libvirt bootstrap VM that hosts an instance of Ironic. This adds some
additional information beyond the normal OpenShift troubleshooting
material.