Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-18054: Emit node events only when retry failure #1837

Conversation

martinkennelly
Copy link
Contributor

Nodes obj is configured via distributed software
components and previous to this patch, we are
sending numerous kubernetes events of error level warning when infact everything is proceeding normally..

Only emit warning events when we fail to configure a node. This is after 15 retry attempts - ~7m currently.

We continue logging every node add/update/delete failure to logs.

Signed-off-by: Martin Kennelly mkennell@redhat.com
(cherry picked from commit 8889f47) (cherry picked from commit dada90d)

Nodes obj is configured via distributed software
components and previous to this patch, we are
sending numerous kubernetes events of error level warning
when infact everything is proceeding normally..

Only emit warning events when we fail to configure a node.
This is after 15 retry attempts - ~7m currently.

We continue logging every node add/update/delete failure to
logs.

Signed-off-by: Martin Kennelly <mkennell@redhat.com>
(cherry picked from commit 8889f47)
(cherry picked from commit dada90d)
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 25, 2023
@openshift-ci-robot
Copy link
Contributor

@martinkennelly: This pull request references Jira Issue OCPBUGS-18054, which is invalid:

  • expected Jira Issue OCPBUGS-18054 to depend on a bug targeting a version in 4.13.0, 4.13.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Nodes obj is configured via distributed software
components and previous to this patch, we are
sending numerous kubernetes events of error level warning when infact everything is proceeding normally..

Only emit warning events when we fail to configure a node. This is after 15 retry attempts - ~7m currently.

We continue logging every node add/update/delete failure to logs.

Signed-off-by: Martin Kennelly mkennell@redhat.com
(cherry picked from commit 8889f47) (cherry picked from commit dada90d)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@martinkennelly
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 25, 2023
@openshift-ci-robot
Copy link
Contributor

@martinkennelly: This pull request references Jira Issue OCPBUGS-18054, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.z) matches configured target version for branch (4.12.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-17910 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE))
  • dependent Jira Issue OCPBUGS-17910 targets the "4.13.z" version, which is one of the valid target versions: 4.13.0, 4.13.z
  • bug has dependents

Requesting review from QA contact:
/cc @huiran0826

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcbw
Copy link
Contributor

dcbw commented Aug 29, 2023

/retest-required
/approve
/lgtm

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 29, 2023
@martinkennelly
Copy link
Contributor Author

/assign @ricky-rav

@ricky-rav
Copy link
Contributor

/lgtm
Thanks, Martin!

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 30, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcbw, martinkennelly, ricky-rav

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jcaamano
Copy link
Contributor

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Aug 31, 2023
@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 31, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ac22a20 and 2 for PR HEAD f0f710b in total

@martinkennelly
Copy link
Contributor Author

/test e2e-aws-ovn-hypershift

Failed because it was unable to find a container image - retesting as it looks transient.

@martinkennelly
Copy link
Contributor Author

Looks like mode migration jobs are permafailing for a few weeks - will ask the team if they know about this.

@martinkennelly
Copy link
Contributor Author

Mode migration jobs are failing and its unrelated to this PR however I should look into why before looking for override. This will take some time as I am currently busy.

@martinkennelly
Copy link
Contributor Author

@jluhrsen I hear you're fixing the mode migration jobs in CNO in 4.12.
I tried to look for a link but only saw a 4.14 PR.
Can you link it here when you get a chance so this PR can proceed when your fixes are in.
Id rather not override the jobs if I expect a fix soon.
Timeline would be nice but not required :) Thanks!

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 2a22d71 and 1 for PR HEAD f0f710b in total

@martinkennelly
Copy link
Contributor Author

I am going to look for override - pointless waiting any longer and wasting $ on jobs failing.

@dcbw
Copy link
Contributor

dcbw commented Sep 13, 2023

/override ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration
/override ci/prow/e2e-aws-ovn-local-to-shared-gateway-mode-migration

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 13, 2023

@dcbw: Overrode contexts on behalf of dcbw: ci/prow/e2e-aws-ovn-local-to-shared-gateway-mode-migration, ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration

In response to this:

/override ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration
/override ci/prow/e2e-aws-ovn-local-to-shared-gateway-mode-migration

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@martinkennelly
Copy link
Contributor Author

Lots of image pull issues:

received unexpected HTTP status: 504 Gateway Time-out

/retest

@martinkennelly
Copy link
Contributor Author

pull-ci-openshift-ovn-kubernetes-release-4.12-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade

Failed because disruption budget limit is 1s and in the CI it was 2s

4.12-upgrade-from-stable-4.11-local-gateway-e2e-aws-ovn-upgrade

When kapi client attempted to pull a CRD to check openshift.io specific crds, it got back a bad reply from API server (?):
rpc error: code = Unknown desc = malformed header: missing HTTP content-type
This lead this test to fail instantly as no retry mechanism there.

@martinkennelly
Copy link
Contributor Author

Both errors unrelated to this PR.

@martinkennelly
Copy link
Contributor Author

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD c40f9a2 and 0 for PR HEAD f0f710b in total

@openshift-ci-robot
Copy link
Contributor

/hold

Revision f0f710b was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 25, 2023
@martinkennelly
Copy link
Contributor Author

/test e2e-aws-ovn-upgrade-local-gateway

@martinkennelly
Copy link
Contributor Author

/unhold

Dont know why the overrides are now gone @dcbw can you reapply?

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD c40f9a2 and 2 for PR HEAD f0f710b in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 4af0a99 and 1 for PR HEAD f0f710b in total

@martinkennelly
Copy link
Contributor Author

GW mode migration will be fixed by https://issues.redhat.com/browse/OCPBUGS-17391.
Waiting on a backport for that to 4.12.

@jcaamano
Copy link
Contributor

/override ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration
/override ci/prow/e2e-aws-ovn-local-to-shared-gateway-mode-migration

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 27, 2023

@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-aws-ovn-local-to-shared-gateway-mode-migration, ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration

In response to this:

/override ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration
/override ci/prow/e2e-aws-ovn-local-to-shared-gateway-mode-migration

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@martinkennelly
Copy link
Contributor Author

/retest

BM CI was bad this morning.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 27, 2023

@martinkennelly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration f0f710b link true /test e2e-aws-ovn-shared-to-local-gateway-mode-migration
ci/prow/e2e-aws-ovn-local-to-shared-gateway-mode-migration f0f710b link true /test e2e-aws-ovn-local-to-shared-gateway-mode-migration

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 87440f4 into openshift:release-4.12 Sep 27, 2023
26 checks passed
@openshift-ci-robot
Copy link
Contributor

@martinkennelly: Jira Issue OCPBUGS-18054: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-18054 has been moved to the MODIFIED state.

In response to this:

Nodes obj is configured via distributed software
components and previous to this patch, we are
sending numerous kubernetes events of error level warning when infact everything is proceeding normally..

Only emit warning events when we fail to configure a node. This is after 15 retry attempts - ~7m currently.

We continue logging every node add/update/delete failure to logs.

Signed-off-by: Martin Kennelly mkennell@redhat.com
(cherry picked from commit 8889f47) (cherry picked from commit dada90d)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.12.0-0.nightly-2023-09-28-010903

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet