Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.13] OCPBUGS-19894: remove prestop hooks for northd, sbdbd and nbdb #2001

Merged
merged 1 commit into from Oct 6, 2023

Conversation

jluhrsen
Copy link
Contributor

when the prestop hook executes the ovn-ctl stop... command this causes the containers main script to exit which also propagates to the main container (PID 1) process which exits causing the container to exit. This is not neccessarily a problem except that with recent changes in CRI-O [0] there is some problem with how kubelet views this case. the containers are Exited for crio but seen as Ready from kubelet's perspective. Because of this, the pod gets stuck in Terminating state until it times out (~3m) and then containers will restart.

removing the prestop hook let's the SIGTERM reach the configured trap (example [1]) run which effectively does the same thing as what the prestop hook would have done. This prevents the pod from being stuck in Terminating state.

[0] cri-o/cri-o#7168
[1]

quit() {
echo "$(date -Iseconds) - stopping ovn-northd"
OVN_MANAGE_OVSDB=no /usr/share/ovn/scripts/ovn-ctl stop_northd
echo "$(date -Iseconds) - ovn-northd stopped"
rm -f /var/run/ovn/ovn-northd.pid
exit 0
}
# end of quit
trap quit TERM INT

when the prestop hook executes the ovn-ctl stop... command this causes
the containers main script to exit which also propagates to the main
container (PID 1) process which exits causing the container to exit.
This is not neccessarily a problem except that with recent changes in
CRI-O [0] there is some problem with how kubelet views this case. the
containers are Exited for crio but seen as Ready from kubelet's
perspective. Because of this, the pod gets stuck in Terminating state
until it times out (~3m) and then containers will restart.

removing the prestop hook let's the SIGTERM reach the configured trap
(example [1]) run which effectively does the same thing as what the
prestop hook would have done. This prevents the pod from being stuck
in Terminating state.

[0] cri-o/cri-o#7168
[1] https://github.com/openshift/cluster-network-operator/blob/c55f19132864a9d8fe347ad6a8280eaa2f60e839/bindata/network/ovn-kubernetes/self-hosted/single-zone-interconnect/ovnkube-master.yaml#L74-L82

Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com>
@jluhrsen jluhrsen changed the title JIRA: https://issues.redhat.com/browse/OCPBUGS-17391 OCPBUGS-17391: remove prestop hooks for northd, sbdbd and nbdb Sep 19, 2023
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 19, 2023
@openshift-ci-robot
Copy link
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-17391, which is invalid:

  • expected the bug to target the "4.13.z" version, but it targets "4.14.0" instead
  • expected Jira Issue OCPBUGS-17391 to depend on a bug targeting a version in 4.14.0 and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

when the prestop hook executes the ovn-ctl stop... command this causes the containers main script to exit which also propagates to the main container (PID 1) process which exits causing the container to exit. This is not neccessarily a problem except that with recent changes in CRI-O [0] there is some problem with how kubelet views this case. the containers are Exited for crio but seen as Ready from kubelet's perspective. Because of this, the pod gets stuck in Terminating state until it times out (~3m) and then containers will restart.

removing the prestop hook let's the SIGTERM reach the configured trap (example [1]) run which effectively does the same thing as what the prestop hook would have done. This prevents the pod from being stuck in Terminating state.

[0] cri-o/cri-o#7168
[1]

quit() {
echo "$(date -Iseconds) - stopping ovn-northd"
OVN_MANAGE_OVSDB=no /usr/share/ovn/scripts/ovn-ctl stop_northd
echo "$(date -Iseconds) - ovn-northd stopped"
rm -f /var/run/ovn/ovn-northd.pid
exit 0
}
# end of quit
trap quit TERM INT

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcbw dcbw changed the title OCPBUGS-17391: remove prestop hooks for northd, sbdbd and nbdb [release-4.13] OCPBUGS-17391: remove prestop hooks for northd, sbdbd and nbdb Sep 28, 2023
@dcbw dcbw changed the title [release-4.13] OCPBUGS-17391: remove prestop hooks for northd, sbdbd and nbdb [release-4.13] OCPBUGS-19894: remove prestop hooks for northd, sbdbd and nbdb Sep 28, 2023
@openshift-ci-robot
Copy link
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-19894, which is invalid:

  • expected dependent Jira Issue OCPBUGS-19808 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

when the prestop hook executes the ovn-ctl stop... command this causes the containers main script to exit which also propagates to the main container (PID 1) process which exits causing the container to exit. This is not neccessarily a problem except that with recent changes in CRI-O [0] there is some problem with how kubelet views this case. the containers are Exited for crio but seen as Ready from kubelet's perspective. Because of this, the pod gets stuck in Terminating state until it times out (~3m) and then containers will restart.

removing the prestop hook let's the SIGTERM reach the configured trap (example [1]) run which effectively does the same thing as what the prestop hook would have done. This prevents the pod from being stuck in Terminating state.

[0] cri-o/cri-o#7168
[1]

quit() {
echo "$(date -Iseconds) - stopping ovn-northd"
OVN_MANAGE_OVSDB=no /usr/share/ovn/scripts/ovn-ctl stop_northd
echo "$(date -Iseconds) - ovn-northd stopped"
rm -f /var/run/ovn/ovn-northd.pid
exit 0
}
# end of quit
trap quit TERM INT

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 2, 2023

/retest

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 2, 2023

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-19894, which is invalid:

  • expected dependent Jira Issue OCPBUGS-19808 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

@jluhrsen: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 3, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Oct 3, 2023
@openshift-ci-robot
Copy link
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-19894, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.z) matches configured target version for branch (4.13.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-19808 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-19808 targets the "4.14.0" version, which is one of the valid target versions: 4.14.0
  • bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Oct 3, 2023
@openshift-ci-robot
Copy link
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-19894, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.z) matches configured target version for branch (4.13.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-19808 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-19808 targets the "4.14.0" version, which is one of the valid target versions: 4.14.0
  • bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

when the prestop hook executes the ovn-ctl stop... command this causes the containers main script to exit which also propagates to the main container (PID 1) process which exits causing the container to exit. This is not neccessarily a problem except that with recent changes in CRI-O [0] there is some problem with how kubelet views this case. the containers are Exited for crio but seen as Ready from kubelet's perspective. Because of this, the pod gets stuck in Terminating state until it times out (~3m) and then containers will restart.

removing the prestop hook let's the SIGTERM reach the configured trap (example [1]) run which effectively does the same thing as what the prestop hook would have done. This prevents the pod from being stuck in Terminating state.

[0] cri-o/cri-o#7168
[1]

quit() {
echo "$(date -Iseconds) - stopping ovn-northd"
OVN_MANAGE_OVSDB=no /usr/share/ovn/scripts/ovn-ctl stop_northd
echo "$(date -Iseconds) - ovn-northd stopped"
rm -f /var/run/ovn/ovn-northd.pid
exit 0
}
# end of quit
trap quit TERM INT

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jcaamano
Copy link
Contributor

jcaamano commented Oct 3, 2023

/lgtm
/approve
/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Oct 3, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 3, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcaamano, jluhrsen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2023
@asood-rh
Copy link

asood-rh commented Oct 3, 2023

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Oct 3, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 777eb7b and 2 for PR HEAD 55ae57d in total

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 4, 2023

/retest

3 similar comments
@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 4, 2023

/retest

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 4, 2023

/retest

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 4, 2023

/retest

@flavio-fernandes
Copy link
Contributor

/retest-required

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 5, 2023

/retest

@flavio-fernandes
Copy link
Contributor

/retest-required

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 6, 2023

/retest

2 similar comments
@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 6, 2023

/retest

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 6, 2023

/retest

@flavio-fernandes
Copy link
Contributor

/retest-required

@sdodson
Copy link
Member

sdodson commented Oct 6, 2023

/override ci/prow/e2e-metal-ipi-ovn-ipv6
This job is known to be broken at the moment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2023

@sdodson: sdodson unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight.

In response to this:

/override ci/prow/e2e-metal-ipi-ovn-ipv6
This job is known to be broken at the moment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcbw
Copy link
Contributor

dcbw commented Oct 6, 2023

/override ci/prow/e2e-metal-ipi-ovn-ipv6

1 similar comment
@dougbtv
Copy link
Member

dougbtv commented Oct 6, 2023

/override ci/prow/e2e-metal-ipi-ovn-ipv6

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2023

@dcbw: Overrode contexts on behalf of dcbw: ci/prow/e2e-metal-ipi-ovn-ipv6

In response to this:

/override ci/prow/e2e-metal-ipi-ovn-ipv6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2023

@dougbtv: Overrode contexts on behalf of dougbtv: ci/prow/e2e-metal-ipi-ovn-ipv6

In response to this:

/override ci/prow/e2e-metal-ipi-ovn-ipv6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2023

@jluhrsen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 55ae57d link true /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot merged commit 961ca3d into openshift:release-4.13 Oct 6, 2023
31 checks passed
@openshift-ci-robot
Copy link
Contributor

@jluhrsen: Jira Issue OCPBUGS-19894: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-19894 has been moved to the MODIFIED state.

In response to this:

when the prestop hook executes the ovn-ctl stop... command this causes the containers main script to exit which also propagates to the main container (PID 1) process which exits causing the container to exit. This is not neccessarily a problem except that with recent changes in CRI-O [0] there is some problem with how kubelet views this case. the containers are Exited for crio but seen as Ready from kubelet's perspective. Because of this, the pod gets stuck in Terminating state until it times out (~3m) and then containers will restart.

removing the prestop hook let's the SIGTERM reach the configured trap (example [1]) run which effectively does the same thing as what the prestop hook would have done. This prevents the pod from being stuck in Terminating state.

[0] cri-o/cri-o#7168
[1]

quit() {
echo "$(date -Iseconds) - stopping ovn-northd"
OVN_MANAGE_OVSDB=no /usr/share/ovn/scripts/ovn-ctl stop_northd
echo "$(date -Iseconds) - ovn-northd stopped"
rm -f /var/run/ovn/ovn-northd.pid
exit 0
}
# end of quit
trap quit TERM INT

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jluhrsen
Copy link
Contributor Author

jluhrsen commented Oct 6, 2023

/cherry-pick release-4.12

@openshift-cherrypick-robot

@jluhrsen: new pull request created: #2055

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.13.0-0.nightly-2023-10-07-050717

@jluhrsen jluhrsen deleted the no-prestop-hook-4.13 branch April 19, 2024 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet