OCPBUGS-11936: Wait for OpenStack Metadata service #4092

stephenfin · 2024-01-03T11:20:07Z

- What I did

When deploying a cluster on OpenStack using OVN-Kubernetes, there can be a delay before the metadata service is available. Afterburn [1] does not provide a mechanism to configure the timeouts and the network-online target merely indicates that we a configured, routable IP address, not necessarily that we can access the metadata service yet. Work around by restarting the service on failure. This is possible since systemd v244.1 [2].

[1] https://github.com/coreos/afterburn
[2] systemd/systemd#13754

- How to verify it

Working on this with QE now.

- Description for the changelog

The afterburn-hostname.service systemd service deployed on the OpenStack will now retry in the event of the metadata service taking longer than expected to come up.

openshift-ci-robot · 2024-01-03T11:20:13Z

@stephenfin: This pull request references Jira Issue OCPBUGS-11936, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

When deploying a cluster on OpenStack using OVN-Kubernetes, there can be a delay before the metadata service is available. afterburn does not provide a mechanism to configure the timeouts and the network-online target merely indicates that we a configured, routable IP address, not necessarily that we can access the metadata service yet. Work around this by adding a check for the specific IP address. This will run for up to 5 minutes before expiring.

- How to verify it

Working on this with QE now.

- Description for the changelog

The afterburn-hostname.service systemd service deployed on the OpenStack will now wait longer for the metadata service to come up.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

stephenfin · 2024-01-03T11:21:38Z

/jira refresh

openshift-ci-robot · 2024-01-03T11:21:44Z

@stephenfin: This pull request references Jira Issue OCPBUGS-11936, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (itbrown@redhat.com), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

stephenfin · 2024-01-22T11:47:35Z

/retest-required

stephenfin · 2024-01-29T14:05:14Z

/retest-required

templates/common/openstack/units/afterburn-hostname.service.yaml

openshift-ci-robot · 2024-02-12T20:16:28Z

@stephenfin: This pull request references Jira Issue OCPBUGS-11936, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (itbrown@redhat.com), skipping review request.

In response to this:

- What I did

When deploying a cluster on OpenStack using OVN-Kubernetes, there can be a delay before the metadata service is available. Afterburn [1] does not provide a mechanism to configure the timeouts and the network-online target merely indicates that we a configured, routable IP address, not necessarily that we can access the metadata service yet. Work around by restarting the service on failure. This is possible since systemd v244.1 [2].

[1] https://github.com/coreos/afterburn
[2] systemd/systemd#13754

- How to verify it

Working on this with QE now.

- Description for the changelog

The afterburn-hostname.service systemd service deployed on the OpenStack will now wait longer for the metadata service to come up.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

mdbooth · 2024-02-13T12:09:52Z

Fortunately I just tested it, because TimeoutStartSec doesn't work that way: it applies to a single startup, not a succession of startups, even when using RestartMode=direct. This is what I think we want:

  [Unit]
  Description=Afterburn Hostname
  # Block services relying on Networking being up.
  Before=network-online.target
  # Wait for NetworkManager to report its online
  After=NetworkManager-wait-online.service
  # Run before hostname checks
  Before=node-valid-hostname.service

  # Allow 120 retries (10 minutes at 5 seconds per retry)
  StartLimitIntervalSec=infinity
  StartLimitBurst=120

  [Service]
  Type=oneshot
  RemainAfterExit=true

  # Retry every 5 seconds on failure without marking the service as failed
  Restart=on-failure
  RestartMode=direct
  RestartSec=5

  ExecStart=/usr/local/bin/openstack-afterburn-hostname

  [Install]
  RequiredBy=network-online.target

Note that the above upgrades WantedBy to RequiredBy. I'm still not 100% on that, though. Perhaps we should leave that to a separate change.

The 10 minute timeout seems excessive, but unfortunately I think this is in the order of what's required on PSI.

When deploying a cluster on OpenStack using OVN-Kubernetes, there can be a delay before the metadata service is available. Afterburn [1] does not provide a mechanism to configure the timeouts and the network-online target merely indicates that we *a* configured, routable IP address, not necessarily that we can access the metadata service yet. Work around by restarting the service on failure. This is possible since systemd v244.1 [2]. [1] https://github.com/coreos/afterburn [2] systemd/systemd#13754 Signed-off-by: Stephen Finucane <stephenfin@redhat.com>

stephenfin · 2024-02-14T10:09:54Z

/retest-required

stephenfin · 2024-02-14T10:10:12Z

/test e2e-openstack

openshift-ci · 2024-02-14T12:17:16Z

@stephenfin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`3bc24f6`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-azure-ovn-upgrade-out-of-change	`3bc24f6`	link	false	`/test e2e-azure-ovn-upgrade-out-of-change`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

stephenfin · 2024-03-19T16:40:39Z

/test e2e-openstack

mdbooth · 2024-04-30T12:01:25Z

/lgtm

openshift-ci · 2024-04-30T12:09:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth, stephenfin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~templates/common/openstack/OWNERS~~ [mdbooth,stephenfin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-04-30T15:13:31Z

@stephenfin: Jira Issue OCPBUGS-11936: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#4092

Jira Issue OCPBUGS-11936 has been moved to the MODIFIED state.

In response to this:

- What I did

When deploying a cluster on OpenStack using OVN-Kubernetes, there can be a delay before the metadata service is available. Afterburn [1] does not provide a mechanism to configure the timeouts and the network-online target merely indicates that we a configured, routable IP address, not necessarily that we can access the metadata service yet. Work around by restarting the service on failure. This is possible since systemd v244.1 [2].

[1] https://github.com/coreos/afterburn
[2] systemd/systemd#13754

- How to verify it

Working on this with QE now.

- Description for the changelog

The afterburn-hostname.service systemd service deployed on the OpenStack will now retry in the event of the metadata service taking longer than expected to come up.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-04-30T22:46:15Z

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.17.0-202404302014.p0.g69944ad.assembly.stream.el9 for distgit ose-machine-config-operator.
All builds following this will include this PR.

openshift-merge-robot · 2024-05-01T19:43:00Z

Fix included in accepted release 4.16.0-0.nightly-2024-05-01-111315

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 3, 2024

openshift-ci bot requested review from mdbooth and pierreprinetti January 3, 2024 11:23

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 3, 2024

mdbooth reviewed Jan 30, 2024

View reviewed changes

templates/common/openstack/units/afterburn-hostname.service.yaml Outdated Show resolved Hide resolved

stephenfin force-pushed the OCPBUGS-11936 branch from f59b3e6 to 2e23f00 Compare February 12, 2024 20:14

stephenfin force-pushed the OCPBUGS-11936 branch from 2e23f00 to 3bc24f6 Compare February 13, 2024 12:22

openshift-ci bot assigned mdbooth Apr 30, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2024

openshift-merge-bot bot merged commit 69944ad into openshift:master Apr 30, 2024
16 of 17 checks passed

pierreprinetti deleted the OCPBUGS-11936 branch April 30, 2024 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-11936: Wait for OpenStack Metadata service #4092

OCPBUGS-11936: Wait for OpenStack Metadata service #4092

stephenfin commented Jan 3, 2024 •

edited

openshift-ci-robot commented Jan 3, 2024

stephenfin commented Jan 3, 2024

openshift-ci-robot commented Jan 3, 2024

stephenfin commented Jan 22, 2024

stephenfin commented Jan 29, 2024

openshift-ci-robot commented Feb 12, 2024

mdbooth commented Feb 13, 2024

stephenfin commented Feb 14, 2024

stephenfin commented Feb 14, 2024

openshift-ci bot commented Feb 14, 2024 •

edited

stephenfin commented Mar 19, 2024

mdbooth commented Apr 30, 2024

openshift-ci bot commented Apr 30, 2024

openshift-ci-robot commented Apr 30, 2024

openshift-bot commented Apr 30, 2024

openshift-merge-robot commented May 1, 2024

OCPBUGS-11936: Wait for OpenStack Metadata service #4092

OCPBUGS-11936: Wait for OpenStack Metadata service #4092

Conversation

stephenfin commented Jan 3, 2024 • edited

openshift-ci-robot commented Jan 3, 2024

stephenfin commented Jan 3, 2024

openshift-ci-robot commented Jan 3, 2024

stephenfin commented Jan 22, 2024

stephenfin commented Jan 29, 2024

openshift-ci-robot commented Feb 12, 2024

mdbooth commented Feb 13, 2024

stephenfin commented Feb 14, 2024

stephenfin commented Feb 14, 2024

openshift-ci bot commented Feb 14, 2024 • edited

stephenfin commented Mar 19, 2024

mdbooth commented Apr 30, 2024

openshift-ci bot commented Apr 30, 2024

openshift-ci-robot commented Apr 30, 2024

openshift-bot commented Apr 30, 2024

openshift-merge-robot commented May 1, 2024

stephenfin commented Jan 3, 2024 •

edited

openshift-ci bot commented Feb 14, 2024 •

edited