-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
baremetal: master deployment can fail due to ironic-conductor not ready #2880
Comments
/label platform/barmetal |
@hardys: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/label platform/baremetal |
Hm, I had a discussion with either @juliakreger or @dtantsur back then and openshift-metal3/terraform-provider-ironic#32 was the outcome of that. Is there a better way to know conductor is fully up, without having to know all the drivers we're waiting for? |
I did also look at the "Alive" flag from This seems like a design flaw in ironic really, but @dhellmann suggested we could perhaps wait for the fake driver and at least one more (since we always expect some additional drivers), then wait for some time until the number is stable (how long tbc, but I guess the loading of drivers should be fairly fast so waiting a few seconds after we get the first non-fake one should be OK?) Having an explicit flag from the API, or having the ironic API 503 in the event there's a single conductor and it's still starting up would be preferable of course. |
Possible workaround in the image ref metal3-io/ironic-image#122 |
I've observed the list being returned by /v1/drivers as empty, containing 3 results and all stages in between (over less then a second) I can see in the ironic DB a row for the conductor being created once, and not getting updated,
then a bunch of rows get added to conductor_hardware_interfaces, individually,
which I think correspond to I believe that even though the conductor is registered with the 3 hardware_types, some don't get returned until each one is added to conductor_hardware_interfaces the likelihood if hitting this race condition is slim but we've also observed that the DB operations (sync) on the environment in question is very slow which is probably increasing the window in which we can hit the problem, ultimately I think we have two problems here
I'm not sure this will work, as it checks the conductor table and the drivers will not yet have been added to the conductor_hardware_interfaces table. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/lifecycle frozen |
This is an important issue that needs fixing, even if it's not something we see very often. |
A fix was available at openshift/baremetal-operator@d6eaf67 |
Verified that on OCP 4.5.7 stable branch is solved. |
Closing as this has been reported as fixed. /close |
@stbenjam: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
It has been observed that in some environments we can attempt to deploy masters while the conductor service on the bootstrap VM is still starting up, this results in a failure like:
Looking at the bootstrap VM ironic logs we see:
then a few seconds late the conductor is finished starting
A partial fix for this already landed in terraform-provider-ironic ref openshift-metal3/terraform-provider-ironic#32 however this doesn't wait long enough for all drivers to be loaded:
The driver list contains "fake-hardware" but not ipmi, which satisifes the terraform check, then later ipmi is added to the list (see the len of the replies goes up to 1641 after the error (thanks to @derekhiggins for helping to figure this out)
We need some way to wait until the conductor is fully initialized, and all enabled drivers are ready for use, before attempting to deploy any master nodes.
The text was updated successfully, but these errors were encountered: