baremetal: master deployment can fail due to ironic-conductor not ready #2880

hardys · 2020-01-06T13:19:25Z

It has been observed that in some environments we can attempt to deploy masters while the conductor service on the bootstrap VM is still starting up, this results in a failure like:

msg="module.masters.ironic_node_v1.openshift-master-host[0]: Still creating... [22m30s elapsed]"
level=error
level=error msg="Error: Bad request with: [POST http://172.22.0.2:6385/v1/nodes], error message: {\"error_message\": \"{\\\"faultcode\\\": \\\"Client\\\", \\\"faultstring\\\": \\\"No valid host was found. Reason: No conductor service registered which supports driver ipmi for conductor group \\\\\\\"\\\\\\\".\\\", \\\"debuginfo\\\": null}\"}"
level=error
level=error msg="  on ../../../../../tmp/openshift-install-968715703/masters/main.tf line 1, in resource \"ironic_node_v1\" \"openshift-master-host\":"

Looking at the bootstrap VM ironic logs we see:

2020-01-06 09:51:46.400 26 DEBUG ironic.common.hash_ring [req-fc6d508c-e93b-4567-ba20-70d34f789558 - - - - -] No conductor from group <none> found for driver ipmi, trying to rebuild the hash rings get_ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:91[00m

then a few seconds late the conductor is finished starting

2020-01-06 09:51:57.347 1 INFO ironic.conductor.base_manager [req-691df76f-fd57-4c5a-bcc6-3d5d5a86dd68 - - - - -] Successfully started conductor with hostname dhcp19-232-242...

A partial fix for this already landed in terraform-provider-ironic ref openshift-metal3/terraform-provider-ironic#32 however this doesn't wait long enough for all drivers to be loaded:

09:51:36.343 26 "GET /v1/drivers HTTP/1.1" len: 335
09:51:41.365 25 "GET /v1/drivers HTTP/1.1" len: 335
09:51:46.387 26 "GET /v1/drivers HTTP/1.1" len: 796

09:51:46.400 26 No conductor from group <none> found for driver ipmi, trying to rebuild the hash rings
09:51:46.402 26 Finished rebuilding hash rings, available drivers are :fake-hardware ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:62[00m
09:51:46.403 26 No conductor service registered which supports driver ipmi for conductor group "".

The driver list contains "fake-hardware" but not ipmi, which satisifes the terraform check, then later ipmi is added to the list (see the len of the replies goes up to 1641 after the error (thanks to @derekhiggins for helping to figure this out)

10:22:54.929 26 "GET /v1/drivers HTTP/1.1" len: 1641

We need some way to wait until the conductor is fully initialized, and all enabled drivers are ready for use, before attempting to deploy any master nodes.

The text was updated successfully, but these errors were encountered:

hardys · 2020-01-06T13:19:42Z

/label platform/barmetal

openshift-ci-robot · 2020-01-06T13:19:43Z

@hardys: The label(s) /label platform/barmetal cannot be applied. These labels are supported: platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga

In response to this:

/label platform/barmetal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hardys · 2020-01-06T13:19:57Z

/label platform/baremetal

stbenjam · 2020-01-06T13:32:51Z

The driver list contains "fake-hardware" but not ipmi, which satisifes the terraform check, then later ipmi is added to the list (see the len of the replies goes up to 1641 after the error (thanks to @derekhiggins for helping to figure this out)

Hm, I had a discussion with either @juliakreger or @dtantsur back then and openshift-metal3/terraform-provider-ironic#32 was the outcome of that. Is there a better way to know conductor is fully up, without having to know all the drivers we're waiting for?

hardys · 2020-01-06T14:16:58Z

The driver list contains "fake-hardware" but not ipmi, which satisifes the terraform check, then later ipmi is added to the list (see the len of the replies goes up to 1641 after the error (thanks to @derekhiggins for helping to figure this out)

Hm, I had a discussion with either @juliakreger or @dtantsur back then and openshift-metal3/terraform-provider-ironic#32 was the outcome of that. Is there a better way to know conductor is fully up, without having to know all the drivers we're waiting for?

I did also look at the "Alive" flag from openstack baremetal conductor list but AFAICS this is just a liveness check of the updated_at field in the DB, e.g useful to see if a conductor died but not really to see if it's still initializing.

This seems like a design flaw in ironic really, but @dhellmann suggested we could perhaps wait for the fake driver and at least one more (since we always expect some additional drivers), then wait for some time until the number is stable (how long tbc, but I guess the loading of drivers should be fairly fast so waiting a few seconds after we get the first non-fake one should be OK?)

Having an explicit flag from the API, or having the ironic API 503 in the event there's a single conductor and it's still starting up would be preferable of course.

hardys · 2020-01-06T15:50:50Z

Possible workaround in the image ref metal3-io/ironic-image#122

derekhiggins · 2020-01-09T12:50:33Z

I've observed the list being returned by /v1/drivers as empty, containing 3 results and all stages in between (over less then a second)

I can see in the ironic DB a row for the conductor being created once, and not getting updated,

INSERT INTO conductors (created_at, updated_at, version, hostname, drivers, online, conductor_group) VALUES ('2020-01-07 16:38:44.992987', '2020-01-07 16:38:44.990759', '1.3', 'localhost.localdomain', '[\"fake-hardware\", \"idrac\", \"ipmi\"]', 1, '')

then a bunch of rows get added to conductor_hardware_interfaces, individually,

INSERT INTO conductor_hardware_interfaces (created_at, updated_at, version, conductor_id, hardware_type, interface_type, interface_name, `default`) VALUES ('2020-01-07 16:38:45.011828', NULL, NULL, 1, 'fake-hardware', 'deploy', 'fake', 0)
INSERT INTO conductor_hardware_interfaces (created_at, updated_at, version, conductor_id, hardware_type, interface_type, interface_name, `default`) VALUES ('2020-01-07 16:38:45.017763', NULL, NULL, 1, 'fake-hardware', 'console', 'no-console', 1)
INSERT INTO conductor_hardware_interfaces (created_at, updated_at, version, conductor_id, hardware_type, interface_type, interface_name, `default`) VALUES ('2020-01-07 16:38:45.023438', NULL, NULL, 1, 'fake-hardware', 'inspect', 'fake', 0)

which I think correspond to
https://opendev.org/openstack/ironic/src/branch/master/ironic/conductor/base_manager.py#L149 and
https://opendev.org/openstack/ironic/src/branch/master/ironic/conductor/base_manager.py#L165

I believe that even though the conductor is registered with the 3 hardware_types, some don't get returned until each one is added to conductor_hardware_interfaces

the likelihood if hitting this race condition is slim but we've also observed that the DB operations (sync) on the environment in question is very slow which is probably increasing the window in which we can hit the problem, ultimately I think we have two problems here

we can't assume that because /v1/drivers is returning items that all the items have loaded
we have a slow DB (which exposed problem 1)

Possible workaround in the image ref metal3-io/ironic-image#122

I'm not sure this will work, as it checks the conductor table and the drivers will not yet have been added to the conductor_hardware_interfaces table.

openshift-bot · 2020-04-08T18:49:52Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-05-08T20:49:45Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

stbenjam · 2020-05-08T21:50:04Z

/lifecycle frozen

stbenjam · 2020-05-08T21:50:19Z

This is an important issue that needs fixing, even if it's not something we see very often.

jparrill · 2020-08-25T17:56:46Z

Facing this one on OCP 4.5.6 Stable branch under IPv6:

Any workaround?

andfasano · 2020-08-26T10:07:01Z

A fix was available at openshift/baremetal-operator@d6eaf67

jparrill · 2020-08-26T10:08:55Z

Verified that on OCP 4.5.7 stable branch is solved.

stbenjam · 2020-11-02T15:36:43Z

Closing as this has been reported as fixed.

/close

openshift-ci-robot · 2020-11-02T15:37:01Z

@stbenjam: Closing this issue.

In response to this:

Closing as this has been reported as fixed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the platform/baremetal IPI bare metal hosts platform label Jan 6, 2020

derekhiggins mentioned this issue Jan 9, 2020

WIP : Wait for expected number of drivers starting API metal3-io/ironic-image#122

Closed

Xenwar mentioned this issue Feb 24, 2020

Replace dnf packages by pip3 Nordix/ironic-image#1

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 8, 2020

openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 8, 2020

openshift-ci-robot closed this as completed Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

baremetal: master deployment can fail due to ironic-conductor not ready #2880

baremetal: master deployment can fail due to ironic-conductor not ready #2880

hardys commented Jan 6, 2020

hardys commented Jan 6, 2020

openshift-ci-robot commented Jan 6, 2020

hardys commented Jan 6, 2020

stbenjam commented Jan 6, 2020

hardys commented Jan 6, 2020

hardys commented Jan 6, 2020

derekhiggins commented Jan 9, 2020 •

edited

Loading

openshift-bot commented Apr 8, 2020

openshift-bot commented May 8, 2020

stbenjam commented May 8, 2020

stbenjam commented May 8, 2020

jparrill commented Aug 25, 2020 •

edited

Loading

andfasano commented Aug 26, 2020

jparrill commented Aug 26, 2020

stbenjam commented Nov 2, 2020

openshift-ci-robot commented Nov 2, 2020

baremetal: master deployment can fail due to ironic-conductor not ready #2880

baremetal: master deployment can fail due to ironic-conductor not ready #2880

Comments

hardys commented Jan 6, 2020

hardys commented Jan 6, 2020

openshift-ci-robot commented Jan 6, 2020

hardys commented Jan 6, 2020

stbenjam commented Jan 6, 2020

hardys commented Jan 6, 2020

hardys commented Jan 6, 2020

derekhiggins commented Jan 9, 2020 • edited Loading

openshift-bot commented Apr 8, 2020

openshift-bot commented May 8, 2020

stbenjam commented May 8, 2020

stbenjam commented May 8, 2020

jparrill commented Aug 25, 2020 • edited Loading

andfasano commented Aug 26, 2020

jparrill commented Aug 26, 2020

stbenjam commented Nov 2, 2020

openshift-ci-robot commented Nov 2, 2020

derekhiggins commented Jan 9, 2020 •

edited

Loading

jparrill commented Aug 25, 2020 •

edited

Loading