Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-20024: Revert "OCPBUGS-13366: ignore repeated TopologyAwareHintsDisabled events" #28233

Merged

Conversation

candita
Copy link
Contributor

@candita candita commented Aug 31, 2023

Revert the status of TopologyAwareHintsDisabled as a known issue, and fix the
unit test that relied on TopologyAwareHintsDisabled being a known issue.

Revert "OCPBUGS-13366: ignore repeated TopologyAwareHintsDisabled events"
This reverts commit 34fce4f.

Add a reference to a known bug that can be considered pathological and interesting.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 31, 2023
@openshift-ci-robot
Copy link

@candita: This pull request references Jira Issue OCPBUGS-13366, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This reverts commit 34fce4f.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@candita candita changed the title Revert "OCPBUGS-13366: ignore repeated TopologyAwareHintsDisabled events" OCPBUGS-5943: Revert "OCPBUGS-13366: ignore repeated TopologyAwareHintsDisabled events" Aug 31, 2023
@openshift-ci openshift-ci bot requested review from bparees and csrwng August 31, 2023 16:48
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 31, 2023
@openshift-ci-robot
Copy link

@candita: This pull request references Jira Issue OCPBUGS-5943, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This reverts commit 34fce4f.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@candita
Copy link
Contributor Author

candita commented Aug 31, 2023

Requested by @soltysh in OCPBUGS-5943.

@candita
Copy link
Contributor Author

candita commented Aug 31, 2023

/jira refresh

@openshift-ci-robot
Copy link

@candita: This pull request references Jira Issue OCPBUGS-5943, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

… fix the

unit test that relied on TopologyAwareHintsDisabled being a known issue.

Revert "OCPBUGS-13366: ignore repeated TopologyAwareHintsDisabled events"
This reverts commit 34fce4f.

Add a reference to a known bug that can be considered pathological and interesting.
@candita candita force-pushed the OCPBUGS-5943-revertKnownProblem branch from 356ee60 to b0791f0 Compare September 5, 2023 21:35
@openshift-ci-robot
Copy link

@candita: This pull request references Jira Issue OCPBUGS-5943, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

Revert the status of TopologyAwareHintsDisabled as a known issue, and fix the
unit test that relied on TopologyAwareHintsDisabled being a known issue.

Revert "OCPBUGS-13366: ignore repeated TopologyAwareHintsDisabled events"
This reverts commit 34fce4f.

Add a reference to a known bug that can be considered pathological and interesting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@candita
Copy link
Contributor Author

candita commented Sep 5, 2023

This unit test failure doesn't appear to be related to my changes.

--- FAIL: TestBackendSampler_checkConnection (3.14s)
--- FAIL: TestBackendSampler_checkConnection/302-no-expect (1.00s)
disruption_backend_sampler_test.go:150: CheckConnection() error = Get "http://www.google.com/": context deadline exceeded (Client.Timeout exceeded while awaiting headers), wantErr false
Sep 5 21:48:20.763: INFO: reason/DisruptionBegan request-audit-id/ backend-disruption-name/backend connection/new disruption/openshift-tests stopped responding to GET requests over new connections: now fail
...
Sep 5 21:48:26.765: INFO: reason/DisruptionSamplerOutageBegan request-audit-id/ DNS lookup timeouts began for backend-disruption-name/backend connection/new disruption/openshift-tests GET requests over new connections: dial tcp: lookup static.redhat.com: i/o timeout (likely a problem in cluster running tests, not the cluster under test)
FAIL
github.com/openshift/origin/pkg/monitor/backenddisruption coverage: 48.0% of statements
FAIL github.com/openshift/origin/pkg/monitor/backenddisruption 11.719s

/test unit

@soltysh
Copy link
Member

soltysh commented Sep 6, 2023

/retest-required

@candita
Copy link
Contributor Author

candita commented Sep 6, 2023

GCP OVN Upgrade did show signs of issues with TopologyAwareHints, but they were preceded by this:

(12 times) from: 2023-09-06 11:51:06 +0000 UTC to 2023-09-06 11:51:07 +0000 UTC
time="2023-09-06T11:51:23Z" level=error msg="disruption sample failed: error running request: 429 Too Many Requests: The apiserver is shutting down, please try again later.\n" auditID=b3d467f8-faed-4ef0-8d77-8b82d67d244f backend=openshift-api-reused-connections this-instance="{Disruption map[backend-disruption-name:openshift-api-reused-connections connection:reused disruption:openshift-tests]}" type=reused
Sep 6 11:51:24.606: INFO: reason/DisruptionBegan request-audit-id/b3d467f8-faed-4ef0-8d77-8b82d67d244f backend-disruption-name/openshift-api-reused-connections connection/reused disruption/openshift-tests stopped responding to GET requests over reused connections: error running request: 429 Too Many Requests: The apiserver is shutting down, please try again later.

/test e2e-gcp-ovn-upgrade

@candita
Copy link
Contributor Author

candita commented Sep 6, 2023

Installation issue:

level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=warning msg=The bootstrap machine is unable to resolve API and/or API-Int Server URLs
level=info msg= root : PWD=/var/opt/openshift ; USER=root ; ENV=KUBECONFIG=/opt/openshift/auth/kubeconfig ; COMMAND=/bin/oc --request-timeout=5s get nodes -o json
level=info msg= root : PWD=/var/opt/openshift ; USER=root ; ENV=KUBECONFIG=/opt/openshift/auth/kubeconfig ; COMMAND=/bin/oc --request-timeout=5s get machines --all-namespaces -o json
level=info msg= root : PWD=/var/opt/openshift ; USER=root ; ENV=KUBECONFIG=/opt/openshift/auth/kubeconfig ; COMMAND=/bin/oc --request-timeout=5s get csr -o json
level=info msg=Bootstrap gather logs captured here "/tmp/installer/log-bundle-20230906145842.tar.gz"
Installer exit with code 5

/test e2e-gcp-ovn-upgrade

@candita
Copy link
Contributor Author

candita commented Sep 12, 2023

There were quite a few TopologyAwareHintsDisabled messages, but these were from 9/6. Will try again, after noticing Sippy results are almost all green now, no longer pathological, and no longer reporting TopologyAwareHintsDisabled.

InvolvedObject:{Kind:Service Namespace:openshift-dns Name:dns-default UID:33570a0c-58ae-4896-aec6-e32af30397f2 APIVersion:v1 ResourceVersion:23371 FieldPath:} Reason:TopologyAwareHintsDisabled Message:Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (5 endpoints, 3 zones), addressType: IPv4 Source:{Component:endpoint-slice-controller Host:} FirstTimestamp:2023-09-06 22:59:42 +0000 UTC LastTimestamp:2023-09-06 22:59:48 +0000 UTC Count:2 Type:Warning EventTime:0001-01-01 00:00:00 +0000 UTC Series:nil Action: Related:nil ReportingController: ReportingInstance:}
resulting new interval: reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (5 endpoints, 3 zones), addressType: IPv4 (2 times) from: 2023-09-06 22:59:48 +0000 UTC to 2023-09-06 22:59:49 +0000 UTC

/test e2e-gcp-ovn-upgrade

Update: Disregard. Sippy results are not representative of this change.

@candita candita changed the title OCPBUGS-5943: Revert "OCPBUGS-13366: ignore repeated TopologyAwareHintsDisabled events" [WIP] OCPBUGS-5943: Revert "OCPBUGS-13366: ignore repeated TopologyAwareHintsDisabled events" Sep 15, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 15, 2023
@candita
Copy link
Contributor Author

candita commented Sep 15, 2023

/hold

Hold for outcome of upstream issue kubernetes/kubernetes#118823, with proposed fix kubernetes/kubernetes#119317.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 15, 2023
@candita
Copy link
Contributor Author

candita commented Oct 4, 2023

/test e2e-gcp-ovn-upgrade

@candita
Copy link
Contributor Author

candita commented Oct 6, 2023

/retest-required

Copy link
Member

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 9, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2023
@soltysh
Copy link
Member

soltysh commented Oct 9, 2023

/hold
I've seen it happen in periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial, let's try to run it:

/periodic job periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2023
@melvinjoseph86
Copy link

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 9, 2023

@candita: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-csi b0791f0 link false /test e2e-gcp-csi
ci/prow/e2e-gcp-ovn-rt-upgrade b0791f0 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-metal-ipi-sdn b0791f0 link false /test e2e-metal-ipi-sdn
ci/prow/e2e-aws-ovn-single-node b0791f0 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@candita
Copy link
Contributor Author

candita commented Oct 10, 2023

/periodic job periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial

@candita
Copy link
Contributor Author

candita commented Oct 10, 2023

error: failed to push image registry.build04.ci.openshift.org/ci-op-jm0m9d6h/release:initial: unable to upload new layer (0): Patch "https://registry.build04.ci.openshift.org/v2/ci-op-jm0m9d6h/release/blobs/uploads/7aab1d99-c143-4a4e-be72-86158692b83c?_state=YOsJRiyZwYVZDs9u3WIuKA1vhBgkb0Hdks2oOjj5wZZ7Ik5hbWUiOiJjaS1vcC1qbTBtOWQ2aC9yZWxlYXNlIiwiVVVJRCI6IjdhYWIxZDk5LWMxNDMtNGE0ZS1iZTcyLTg2MTU4NjkyYjgzYyIsIk9mZnNldCI6MCwiU3RhcnRlZEF0IjoiMjAyMy0xMC0wOVQxMjoxNDozMi43NTgwNDk0MzNaIn0%3D": operator "cluster-network-operator" contained an invalid image-references file: no input image tag named "kuryr-cni"

/test e2e-gcp-ovn-upgrade

@candita
Copy link
Contributor Author

candita commented Oct 10, 2023

/test images

1 similar comment
@candita
Copy link
Contributor Author

candita commented Oct 10, 2023

/test images

@soltysh
Copy link
Member

soltysh commented Oct 12, 2023

/periodic-job periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial

@soltysh
Copy link
Member

soltysh commented Oct 12, 2023

I again checked our logs, and I only see this problem happening in 4.13 and 4.14, so I guess we can just go ahead with this as is.
/hold cancel

@soltysh
Copy link
Member

soltysh commented Oct 12, 2023

/retest-required

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2023
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 3a53070 and 2 for PR HEAD b0791f0 in total

@openshift-ci openshift-ci bot merged commit 0098195 into openshift:master Oct 13, 2023
18 of 22 checks passed
@openshift-ci-robot
Copy link

@candita: Jira Issue OCPBUGS-20024: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-20024 has not been moved to the MODIFIED state.

In response to this:

Revert the status of TopologyAwareHintsDisabled as a known issue, and fix the
unit test that relied on TopologyAwareHintsDisabled being a known issue.

Revert "OCPBUGS-13366: ignore repeated TopologyAwareHintsDisabled events"
This reverts commit 34fce4f.

Add a reference to a known bug that can be considered pathological and interesting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.15.0-0.nightly-2023-11-01-040931

stbenjam added a commit to stbenjam/origin that referenced this pull request Nov 6, 2023
…revertKnownProblem"

This reverts commit 0098195, reversing
changes made to 72fa973.
openshift-merge-bot bot added a commit that referenced this pull request Nov 6, 2023
TRT-1339: Revert #28233 "ignore repeated TopologyAwareHintsDisabled events"
xueqzhan added a commit to xueqzhan/origin that referenced this pull request Nov 7, 2023
openshift-merge-bot bot added a commit that referenced this pull request Nov 10, 2023
…9276048783

Revert "TRT-1339: Revert #28233 "ignore repeated TopologyAwareHintsDisabled events""
@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.15.0-0.nightly-2024-01-05-151121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants