Skip to content

USHIFT-1972: Fix service restart race condition for services that exit#2695

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
ggiguash:fix_greenboot_race
Dec 12, 2023
Merged

USHIFT-1972: Fix service restart race condition for services that exit#2695
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
ggiguash:fix_greenboot_race

Conversation

@ggiguash
Copy link
Contributor

@ggiguash ggiguash commented Dec 7, 2023

When services are started synchronously using systemctl start or systemctl restart, they go through start and then running sub-state.

The race condition was on services that do not have a "long" running phase, i.e. greenboot. Those services run until completion and exit. So, there is a start, very short running and exited sub-states for those services.

To fix the race condition with minimal changes to the existing infrastructure, a special systemctl restart logic was introduced for the greenboot-healthcheck.service.

In addition, the fixes include:

  • Remove the Systemctl With Retry keyword
  • Refactor the Systemctl keyword to shorten its length and address linter warning

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 7, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 7, 2023

@ggiguash: This pull request references USHIFT-1972 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.15.0" version, but no target version was set.

Details

In response to this:

  • Servicectl keyword now uses the --no-block option to allow immediate return in case of start and restart commands. This allows for proper service status check in case of greenboot, etc. that can terminate and not be running.
  • Remove Systemctl With Retry keyword
  • Fix setting of local variables to use Set Local Variable keyword and not Set Variable. This ensures the scope of variables remain local and not per-suite.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from pliurh and pmtk December 7, 2023 07:11
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 7, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 7, 2023

@ggiguash: This pull request references USHIFT-1972 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.15.0" version, but no target version was set.

Details

In response to this:

  • Servicectl keyword now uses the --no-block option to allow immediate return in case of start and restart commands. This allows for proper service status check in case of greenboot, etc. that can terminate and not be running.
  • Remove Systemctl With Retry keyword
  • Fix setting of local variables to use the explicit Set Local Variable keyword and not Set Variable

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 7, 2023

@ggiguash: This pull request references USHIFT-1972 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.15.0" version, but no target version was set.

Details

In response to this:

  • Servicectl keyword now uses the --no-block option to allow immediate return in case of start and restart commands. This allows for proper service status check in case of greenboot, etc. that can terminate and not be running.
  • Remove Systemctl With Retry keyword

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ggiguash
Copy link
Contributor Author

ggiguash commented Dec 7, 2023

/test microshift-metal-tests microshift-metal-tests-arm

@ggiguash
Copy link
Contributor Author

ggiguash commented Dec 7, 2023

/test metal-periodic-test

@ggiguash
Copy link
Contributor Author

ggiguash commented Dec 8, 2023

/test metal-periodic-test

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 8, 2023

@ggiguash: This pull request references USHIFT-1972 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

Details

In response to this:

When services are started synchronously using systemctl start or systemctl restart, they go through start and then running sub-state.

The race condition was on services that do not have a "long" running phase, i.e. greenboot. Those services run until completion and exit. So, there is a start, very short running and exited sub-states for those services.

To fix the race condition, services can be started and restarted using --no-block option. This allows for a quick return from the systemctl command and then status check for the start sub-state. If the service fails later, this needs to be checked using other service-specific checks, which are performed in the tests anyway.

The fixes include:

  • Servicectl keyword now uses the --no-block option to allow immediate return in case of start and restart commands. This allows for proper service status check in case of greenboot, etc. that can terminate and not be running.
  • Remove Systemctl With Retry keyword

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 8, 2023

@ggiguash: This pull request references USHIFT-1972 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

Details

In response to this:

When services are started synchronously using systemctl start or systemctl restart, they go through start and then running sub-state.

The race condition was on services that do not have a "long" running phase, i.e. greenboot. Those services run until completion and exit. So, there is a start, very short running and exited sub-states for those services.

To fix the race condition, services can be started and restarted using --no-block option. This allows for a quick return from the systemctl command and then status check for the start or running sub-states. If the service fails later, this needs to be checked using other service-specific checks, which are performed in the tests anyway.

The fixes include:

  • Servicectl keyword now uses the --no-block option to allow immediate return in case of start and restart commands. This allows for proper service status check in case of greenboot, etc. that can terminate and not be running.
  • Remove Systemctl With Retry keyword

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ggiguash
Copy link
Contributor Author

ggiguash commented Dec 8, 2023

/test metal-periodic-test

@ggiguash
Copy link
Contributor Author

ggiguash commented Dec 8, 2023

/test metal-periodic-test

@ggiguash
Copy link
Contributor Author

ggiguash commented Dec 8, 2023

/test metal-periodic-test

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 10, 2023

@ggiguash: This pull request references USHIFT-1972 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

Details

In response to this:

When services are started synchronously using systemctl start or systemctl restart, they go through start and then running sub-state.

The race condition was on services that do not have a "long" running phase, i.e. greenboot. Those services run until completion and exit. So, there is a start, very short running and exited sub-states for those services.

To fix the race condition with minimal changes to the existing infrastructure, a special systemctl restart logic was introduced for the greenboot-healthcheck.service.

In addition, the fixes include:

  • Remove the Systemctl With Retry keyword
  • Refactor the Systemctl keyword to shorten its length and address linter warning

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ggiguash
Copy link
Contributor Author

/test metal-periodic-test

@ggiguash
Copy link
Contributor Author

/test microshift-metal-tests microshift-metal-tests-arm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 11, 2023

@ggiguash: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/metal-periodic-test-arm e0eb6934eb973d4b32c2324ed263a1ac180e6f4f link true /test metal-periodic-test-arm

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pmtk
Copy link
Member

pmtk commented Dec 12, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 12, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ggiguash, pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 70a7a2b and 2 for PR HEAD 74a589a in total

@openshift-merge-bot openshift-merge-bot bot merged commit bb0ccaa into openshift:main Dec 12, 2023
@ggiguash
Copy link
Contributor Author

/cherry-pick release-4.15

@openshift-cherrypick-robot

@ggiguash: new pull request created: #2734

Details

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ggiguash ggiguash deleted the fix_greenboot_race branch December 14, 2023 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants