Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-29397: 4.14 High CPU usage with APB CRD #2118

Conversation

jordigilh
Copy link
Contributor

@jordigilh jordigilh commented Apr 11, 2024

Fixes an issue when using APB in a cluster with high number of pods where the APB controller would hit the KAPI for each pod event to update the APB status, causing a high cpu usage.
The fix resolves in checking if the last message in the APB status in the informer matches the same message generated for the pod event, and if different or different status (succeeded or failed) then proceed to request the latest copy of the APB CR from the KAPI server and update it.

Signed-off-by: Jordi Gil <jgil@redhat.com>
@jordigilh jordigilh requested a review from dcbw as a code owner April 11, 2024 15:56
@jordigilh jordigilh marked this pull request as draft April 11, 2024 15:56
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 11, 2024
@jordigilh jordigilh marked this pull request as ready for review April 11, 2024 23:44
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 11, 2024
@openshift-ci openshift-ci bot requested review from abhat and tssurya April 11, 2024 23:44
@jordigilh
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 11, 2024
@jordigilh jordigilh force-pushed the fix/disable_apb_status_update branch from 8e2a60d to 3c91d65 Compare April 12, 2024 14:28
…e in status to avoid hitting the KAPI server for any pod status change

Signed-off-by: Jordi Gil <jgil@redhat.com>
@jordigilh jordigilh force-pushed the fix/disable_apb_status_update branch from 3c91d65 to b9eeed6 Compare April 12, 2024 15:49
@jordigilh jordigilh changed the title [DRAFT] [OCPBUGS-29397] 4.14 High CPU usage with APB CRD [OCPBUGS-29397] 4.14 High CPU usage with APB CRD Apr 24, 2024
…g initialized to avoid the risk of time being initialized and slice not having any element

Signed-off-by: jordigilh <jgil@redhat.com>
@jordigilh
Copy link
Contributor Author

/retest-required

@jordigilh
Copy link
Contributor Author

/retest-required

Signed-off-by: jordigilh <jgil@redhat.com>
@jordigilh jordigilh force-pushed the fix/disable_apb_status_update branch from f6144fa to b8d9ebe Compare April 26, 2024 23:09
@jordigilh
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 26, 2024
Copy link
Contributor

openshift-ci bot commented Apr 27, 2024

@jordigilh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security b8d9ebe link false /test security

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@npinaeva
Copy link
Member

/lgtm

@trozet
Copy link
Contributor

trozet commented Apr 30, 2024

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Apr 30, 2024
Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@trozet
Copy link
Contributor

trozet commented Apr 30, 2024

/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2024
@trozet
Copy link
Contributor

trozet commented Apr 30, 2024

/retitle OCPBUGS-29397: 4.14 High CPU usage with APB CRD

@openshift-ci openshift-ci bot changed the title [OCPBUGS-29397] 4.14 High CPU usage with APB CRD OCPBUGS-29397: 4.14 High CPU usage with APB CRD Apr 30, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 30, 2024
@openshift-ci-robot
Copy link
Contributor

@jordigilh: This pull request references Jira Issue OCPBUGS-29397, which is invalid:

  • expected the bug to target the "4.14.z" version, but no target version was set
  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required"
  • expected Jira Issue OCPBUGS-29397 to depend on a bug targeting a version in 4.15.0, 4.15.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Fixes an issue when using APB in a cluster with high number of pods where the APB controller would hit the KAPI for each pod event to update the APB status, causing a high cpu usage.
The fix resolves in checking if the last message in the APB status in the informer matches the same message generated for the pod event, and if different or different status (succeeded or failed) then proceed to request the latest copy of the APB CR from the KAPI server and update it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Apr 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jordigilh, npinaeva, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2024
@jechen0648
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Apr 30, 2024
@asood-rh
Copy link

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 30, 2024
@josecastillolema
Copy link
Contributor

From a perf/scale perspective the fix was successfully validated on the ScaleLab at a 120 node scale.

@jordigilh
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@jordigilh: This pull request references Jira Issue OCPBUGS-29397, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required"
  • expected Jira Issue OCPBUGS-29397 to depend on a bug targeting a version in 4.15.0, 4.15.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jordigilh
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@jordigilh: This pull request references Jira Issue OCPBUGS-29397, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required"
  • expected Jira Issue OCPBUGS-29397 to depend on a bug targeting a version in 4.15.0, 4.15.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jordigilh
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@jordigilh: This pull request references Jira Issue OCPBUGS-29397, which is invalid:

  • expected dependent Jira Issue OCPBUGS-33213 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is Closed (Won't Do) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jordigilh
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 2, 2024
@openshift-ci-robot
Copy link
Contributor

@jordigilh: This pull request references Jira Issue OCPBUGS-29397, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.z) matches configured target version for branch (4.14.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note type set to "Release Note Not Required"
  • dependent bug Jira Issue OCPBUGS-33213 is in the state Closed (Done), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-33213 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0, 4.15.z
  • bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira (dwilson@redhat.com), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 2, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit cc219eb into openshift:release-4.14 May 2, 2024
26 of 27 checks passed
@openshift-ci-robot
Copy link
Contributor

@jordigilh: Jira Issue OCPBUGS-29397: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-29397 has been moved to the MODIFIED state.

In response to this:

Fixes an issue when using APB in a cluster with high number of pods where the APB controller would hit the KAPI for each pod event to update the APB status, causing a high cpu usage.
The fix resolves in checking if the last message in the APB status in the informer matches the same message generated for the pod event, and if different or different status (succeeded or failed) then proceed to request the latest copy of the APB CR from the KAPI server and update it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-ovn-kubernetes-base-container-v4.14.0-202405021608.p0.gcc219eb.assembly.stream.el9 for distgit ovn-kubernetes-base.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.14.0-0.nightly-2024-05-02-211455

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet