Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baremetal: master deployment can fail due to ironic-conductor not ready #2880

Closed
hardys opened this issue Jan 6, 2020 · 16 comments
Closed
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. platform/baremetal IPI bare metal hosts platform

Comments

@hardys
Copy link
Contributor

hardys commented Jan 6, 2020

It has been observed that in some environments we can attempt to deploy masters while the conductor service on the bootstrap VM is still starting up, this results in a failure like:

msg="module.masters.ironic_node_v1.openshift-master-host[0]: Still creating... [22m30s elapsed]"
level=error
level=error msg="Error: Bad request with: [POST http://172.22.0.2:6385/v1/nodes], error message: {\"error_message\": \"{\\\"faultcode\\\": \\\"Client\\\", \\\"faultstring\\\": \\\"No valid host was found. Reason: No conductor service registered which supports driver ipmi for conductor group \\\\\\\"\\\\\\\".\\\", \\\"debuginfo\\\": null}\"}"
level=error
level=error msg="  on ../../../../../tmp/openshift-install-968715703/masters/main.tf line 1, in resource \"ironic_node_v1\" \"openshift-master-host\":"

Looking at the bootstrap VM ironic logs we see:

2020-01-06 09:51:46.400 26 DEBUG ironic.common.hash_ring [req-fc6d508c-e93b-4567-ba20-70d34f789558 - - - - -] No conductor from group <none> found for driver ipmi, trying to rebuild the hash rings get_ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:91[00m

then a few seconds late the conductor is finished starting

2020-01-06 09:51:57.347 1 INFO ironic.conductor.base_manager [req-691df76f-fd57-4c5a-bcc6-3d5d5a86dd68 - - - - -] Successfully started conductor with hostname dhcp19-232-242...

A partial fix for this already landed in terraform-provider-ironic ref openshift-metal3/terraform-provider-ironic#32 however this doesn't wait long enough for all drivers to be loaded:

09:51:36.343 26 "GET /v1/drivers HTTP/1.1" len: 335
09:51:41.365 25 "GET /v1/drivers HTTP/1.1" len: 335
09:51:46.387 26 "GET /v1/drivers HTTP/1.1" len: 796

09:51:46.400 26 No conductor from group <none> found for driver ipmi, trying to rebuild the hash rings
09:51:46.402 26 Finished rebuilding hash rings, available drivers are :fake-hardware ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:62[00m
09:51:46.403 26 No conductor service registered which supports driver ipmi for conductor group "".

The driver list contains "fake-hardware" but not ipmi, which satisifes the terraform check, then later ipmi is added to the list (see the len of the replies goes up to 1641 after the error (thanks to @derekhiggins for helping to figure this out)

10:22:54.929 26 "GET /v1/drivers HTTP/1.1" len: 1641

We need some way to wait until the conductor is fully initialized, and all enabled drivers are ready for use, before attempting to deploy any master nodes.

@hardys
Copy link
Contributor Author

hardys commented Jan 6, 2020

/label platform/barmetal

@openshift-ci-robot
Copy link
Contributor

@hardys: The label(s) /label platform/barmetal cannot be applied. These labels are supported: platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga

In response to this:

/label platform/barmetal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hardys
Copy link
Contributor Author

hardys commented Jan 6, 2020

/label platform/baremetal

@openshift-ci-robot openshift-ci-robot added the platform/baremetal IPI bare metal hosts platform label Jan 6, 2020
@stbenjam
Copy link
Member

stbenjam commented Jan 6, 2020

The driver list contains "fake-hardware" but not ipmi, which satisifes the terraform check, then later ipmi is added to the list (see the len of the replies goes up to 1641 after the error (thanks to @derekhiggins for helping to figure this out)

Hm, I had a discussion with either @juliakreger or @dtantsur back then and openshift-metal3/terraform-provider-ironic#32 was the outcome of that. Is there a better way to know conductor is fully up, without having to know all the drivers we're waiting for?

@hardys
Copy link
Contributor Author

hardys commented Jan 6, 2020

The driver list contains "fake-hardware" but not ipmi, which satisifes the terraform check, then later ipmi is added to the list (see the len of the replies goes up to 1641 after the error (thanks to @derekhiggins for helping to figure this out)

Hm, I had a discussion with either @juliakreger or @dtantsur back then and openshift-metal3/terraform-provider-ironic#32 was the outcome of that. Is there a better way to know conductor is fully up, without having to know all the drivers we're waiting for?

I did also look at the "Alive" flag from openstack baremetal conductor list but AFAICS this is just a liveness check of the updated_at field in the DB, e.g useful to see if a conductor died but not really to see if it's still initializing.

This seems like a design flaw in ironic really, but @dhellmann suggested we could perhaps wait for the fake driver and at least one more (since we always expect some additional drivers), then wait for some time until the number is stable (how long tbc, but I guess the loading of drivers should be fairly fast so waiting a few seconds after we get the first non-fake one should be OK?)

Having an explicit flag from the API, or having the ironic API 503 in the event there's a single conductor and it's still starting up would be preferable of course.

@hardys
Copy link
Contributor Author

hardys commented Jan 6, 2020

Possible workaround in the image ref metal3-io/ironic-image#122

@derekhiggins
Copy link
Contributor

derekhiggins commented Jan 9, 2020

I've observed the list being returned by /v1/drivers as empty, containing 3 results and all stages in between (over less then a second)

I can see in the ironic DB a row for the conductor being created once, and not getting updated,

INSERT INTO conductors (created_at, updated_at, version, hostname, drivers, online, conductor_group) VALUES ('2020-01-07 16:38:44.992987', '2020-01-07 16:38:44.990759', '1.3', 'localhost.localdomain', '[\"fake-hardware\", \"idrac\", \"ipmi\"]', 1, '')

then a bunch of rows get added to conductor_hardware_interfaces, individually,

INSERT INTO conductor_hardware_interfaces (created_at, updated_at, version, conductor_id, hardware_type, interface_type, interface_name, `default`) VALUES ('2020-01-07 16:38:45.011828', NULL, NULL, 1, 'fake-hardware', 'deploy', 'fake', 0)
INSERT INTO conductor_hardware_interfaces (created_at, updated_at, version, conductor_id, hardware_type, interface_type, interface_name, `default`) VALUES ('2020-01-07 16:38:45.017763', NULL, NULL, 1, 'fake-hardware', 'console', 'no-console', 1)
INSERT INTO conductor_hardware_interfaces (created_at, updated_at, version, conductor_id, hardware_type, interface_type, interface_name, `default`) VALUES ('2020-01-07 16:38:45.023438', NULL, NULL, 1, 'fake-hardware', 'inspect', 'fake', 0)

which I think correspond to
https://opendev.org/openstack/ironic/src/branch/master/ironic/conductor/base_manager.py#L149 and
https://opendev.org/openstack/ironic/src/branch/master/ironic/conductor/base_manager.py#L165

I believe that even though the conductor is registered with the 3 hardware_types, some don't get returned until each one is added to conductor_hardware_interfaces

the likelihood if hitting this race condition is slim but we've also observed that the DB operations (sync) on the environment in question is very slow which is probably increasing the window in which we can hit the problem, ultimately I think we have two problems here

  1. we can't assume that because /v1/drivers is returning items that all the items have loaded
  2. we have a slow DB (which exposed problem 1)

Possible workaround in the image ref metal3-io/ironic-image#122

I'm not sure this will work, as it checks the conductor table and the drivers will not yet have been added to the conductor_hardware_interfaces table.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2020
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 8, 2020
@stbenjam
Copy link
Member

stbenjam commented May 8, 2020

/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 8, 2020
@stbenjam
Copy link
Member

stbenjam commented May 8, 2020

This is an important issue that needs fixing, even if it's not something we see very often.

@jparrill
Copy link

jparrill commented Aug 25, 2020

Facing this one on OCP 4.5.6 Stable branch under IPv6:

image

Any workaround?

@andfasano
Copy link
Contributor

A fix was available at openshift/baremetal-operator@d6eaf67

@jparrill
Copy link

Verified that on OCP 4.5.7 stable branch is solved.

@stbenjam
Copy link
Member

stbenjam commented Nov 2, 2020

Closing as this has been reported as fixed.

/close

@openshift-ci-robot
Copy link
Contributor

@stbenjam: Closing this issue.

In response to this:

Closing as this has been reported as fixed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. platform/baremetal IPI bare metal hosts platform
Projects
None yet
Development

No branches or pull requests

7 participants