Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-11936: Wait for OpenStack Metadata service #4092

Merged
merged 1 commit into from
Apr 30, 2024

Conversation

stephenfin
Copy link
Contributor

@stephenfin stephenfin commented Jan 3, 2024

- What I did

When deploying a cluster on OpenStack using OVN-Kubernetes, there can be a delay before the metadata service is available. Afterburn [1] does not provide a mechanism to configure the timeouts and the network-online target merely indicates that we a configured, routable IP address, not necessarily that we can access the metadata service yet. Work around by restarting the service on failure. This is possible since systemd v244.1 [2].

[1] https://github.com/coreos/afterburn
[2] systemd/systemd#13754

- How to verify it

Working on this with QE now.

- Description for the changelog

The afterburn-hostname.service systemd service deployed on the OpenStack will now retry in the event of the metadata service taking longer than expected to come up.

@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 3, 2024
@openshift-ci-robot
Copy link
Contributor

@stephenfin: This pull request references Jira Issue OCPBUGS-11936, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

When deploying a cluster on OpenStack using OVN-Kubernetes, there can be a delay before the metadata service is available. afterburn does not provide a mechanism to configure the timeouts and the network-online target merely indicates that we a configured, routable IP address, not necessarily that we can access the metadata service yet. Work around this by adding a check for the specific IP address. This will run for up to 5 minutes before expiring.

- How to verify it

Working on this with QE now.

- Description for the changelog

The afterburn-hostname.service systemd service deployed on the OpenStack will now wait longer for the metadata service to come up.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@stephenfin
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 3, 2024
@openshift-ci-robot
Copy link
Contributor

@stephenfin: This pull request references Jira Issue OCPBUGS-11936, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (itbrown@redhat.com), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 3, 2024
@stephenfin
Copy link
Contributor Author

/retest-required

1 similar comment
@stephenfin
Copy link
Contributor Author

/retest-required

@openshift-ci-robot
Copy link
Contributor

@stephenfin: This pull request references Jira Issue OCPBUGS-11936, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (itbrown@redhat.com), skipping review request.

In response to this:

- What I did

When deploying a cluster on OpenStack using OVN-Kubernetes, there can be a delay before the metadata service is available. Afterburn [1] does not provide a mechanism to configure the timeouts and the network-online target merely indicates that we a configured, routable IP address, not necessarily that we can access the metadata service yet. Work around by restarting the service on failure. This is possible since systemd v244.1 [2].

[1] https://github.com/coreos/afterburn
[2] systemd/systemd#13754

- How to verify it

Working on this with QE now.

- Description for the changelog

The afterburn-hostname.service systemd service deployed on the OpenStack will now wait longer for the metadata service to come up.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mdbooth
Copy link
Contributor

mdbooth commented Feb 13, 2024

Fortunately I just tested it, because TimeoutStartSec doesn't work that way: it applies to a single startup, not a succession of startups, even when using RestartMode=direct. This is what I think we want:

  [Unit]
  Description=Afterburn Hostname
  # Block services relying on Networking being up.
  Before=network-online.target
  # Wait for NetworkManager to report its online
  After=NetworkManager-wait-online.service
  # Run before hostname checks
  Before=node-valid-hostname.service

  # Allow 120 retries (10 minutes at 5 seconds per retry)
  StartLimitIntervalSec=infinity
  StartLimitBurst=120

  [Service]
  Type=oneshot
  RemainAfterExit=true

  # Retry every 5 seconds on failure without marking the service as failed
  Restart=on-failure
  RestartMode=direct
  RestartSec=5

  ExecStart=/usr/local/bin/openstack-afterburn-hostname

  [Install]
  RequiredBy=network-online.target

Note that the above upgrades WantedBy to RequiredBy. I'm still not 100% on that, though. Perhaps we should leave that to a separate change.

The 10 minute timeout seems excessive, but unfortunately I think this is in the order of what's required on PSI.

When deploying a cluster on OpenStack using OVN-Kubernetes, there can be
a delay before the metadata service is available. Afterburn [1] does not
provide a mechanism to configure the timeouts and the network-online
target merely indicates that we *a* configured, routable IP address, not
necessarily that we can access the metadata service yet. Work around
by restarting the service on failure. This is possible since systemd
v244.1 [2].

[1] https://github.com/coreos/afterburn
[2] systemd/systemd#13754

Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
@stephenfin
Copy link
Contributor Author

/retest-required

@stephenfin
Copy link
Contributor Author

/test e2e-openstack

Copy link
Contributor

openshift-ci bot commented Feb 14, 2024

@stephenfin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 3bc24f6 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-azure-ovn-upgrade-out-of-change 3bc24f6 link false /test e2e-azure-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@stephenfin
Copy link
Contributor Author

/test e2e-openstack

@mdbooth
Copy link
Contributor

mdbooth commented Apr 30, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2024
Copy link
Contributor

openshift-ci bot commented Apr 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth, stephenfin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 69944ad into openshift:master Apr 30, 2024
16 of 17 checks passed
@openshift-ci-robot
Copy link
Contributor

@stephenfin: Jira Issue OCPBUGS-11936: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-11936 has been moved to the MODIFIED state.

In response to this:

- What I did

When deploying a cluster on OpenStack using OVN-Kubernetes, there can be a delay before the metadata service is available. Afterburn [1] does not provide a mechanism to configure the timeouts and the network-online target merely indicates that we a configured, routable IP address, not necessarily that we can access the metadata service yet. Work around by restarting the service on failure. This is possible since systemd v244.1 [2].

[1] https://github.com/coreos/afterburn
[2] systemd/systemd#13754

- How to verify it

Working on this with QE now.

- Description for the changelog

The afterburn-hostname.service systemd service deployed on the OpenStack will now retry in the event of the metadata service taking longer than expected to come up.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pierreprinetti pierreprinetti deleted the OCPBUGS-11936 branch April 30, 2024 15:57
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.17.0-202404302014.p0.g69944ad.assembly.stream.el9 for distgit ose-machine-config-operator.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-05-01-111315

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants