Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-26601: Re-enable test/extended/router/http2 tests on AWS #28515

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

frobware
Copy link
Contributor

@frobware frobware commented Jan 10, 2024

It's been a long time since we disabled these tests on AWS. I have been running the http2 tests on AWS all week and I haven't run into the issue once. Let's re-enable the http2 x AWS tests for better coverage.

This PR also addresses an intermittent issue encountered in AWS environments during the router's h2spec conformance tests. The challenge involved slower hostname resolution within the cluster, resulting in frequent timeouts. Notably, AWS exhibited slower resolution times compared to Azure or GCP, hinting at potential differences in DNS handling.

The solution implemented in this PR focuses on resolving the hostname on the test host before initiating the h2spec tests within the cluster. This adjustment has resulted in a remarkable improvement in test execution speed, with the h2spec test now completing in approximately 85 seconds, a significant reduction from the previous average of over 376 seconds (just above the 5-minute mark).

While the difference in resolution times suggests environmental variations, particularly in AWS, it's important to note that this PR does not definitively attribute the issue to negative caching. Instead, it prioritises the substantial improvement achieved through the new approach. As a precaution, the polling interval and overall test timeout have been adjusted to 2 seconds and 10 minutes, respectively, to enhance test success rates across diverse cloud environments.

This PR represents a practical win in terms of improved test efficiency, while acknowledging potential environmental differences for further investigation, if needed, in the future.

Original bug: https://bugzilla.redhat.com/show_bug.cgi?id=1912413

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 10, 2024
@openshift-ci-robot
Copy link

@frobware: This pull request references Jira Issue OCPBUGS-26601, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

It's been a long time since we disabled these tests on AWS. I have
been running the http2 tests on AWS all week and I haven't run into
the issue once. Let's re-enable the http2 x AWS tests for better
coverage.

Original bug: https://bugzilla.redhat.com/show_bug.cgi?id=1912413

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from knobunc and Miciah January 10, 2024 15:15
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2024
@frobware
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jan 10, 2024
@openshift-ci-robot
Copy link

@frobware: This pull request references Jira Issue OCPBUGS-26601, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 10, 2024
@openshift-ci openshift-ci bot requested a review from lihongan January 10, 2024 15:23
@candita
Copy link
Contributor

candita commented Jan 10, 2024

See #26089

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD e913a64 and 2 for PR HEAD 09eb1f0 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 52f2f6b and 1 for PR HEAD 09eb1f0 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 663c840 and 0 for PR HEAD 09eb1f0 in total

@openshift-ci-robot
Copy link

/hold

Revision 09eb1f0 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 11, 2024
@frobware
Copy link
Contributor Author

/retest

It's been a long time since we disabled these tests on AWS. I have
been running the http2 tests on AWS all week and I haven't run into
the issue once. Let's re-enable the http2 x AWS tests for better
coverage.

Original bug: https://bugzilla.redhat.com/show_bug.cgi?id=1912413
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 11, 2024
@openshift-ci-robot
Copy link

@frobware: This pull request references Jira Issue OCPBUGS-26601, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

It's been a long time since we disabled these tests on AWS. I have been running the http2 tests on AWS all week and I haven't run into the issue once. Let's re-enable the http2 x AWS tests for better coverage.

This PR also addresses an intermittent issue encountered in AWS environments during the router's h2spec conformance tests. The challenge involved slower hostname resolution within the cluster, resulting in frequent timeouts. Notably, AWS exhibited slower resolution times compared to Azure or GCP, hinting at potential differences in DNS handling.

The solution implemented in this PR focuses on resolving the hostname on the test host before initiating the h2spec tests within the cluster. This adjustment has resulted in a remarkable improvement in test execution speed, with the h2spec test now completing in approximately 85 seconds, a significant reduction from the previous average of over 376 seconds (just above the 5-minute mark).

While the difference in resolution times suggests environmental variations, particularly in AWS, it's important to note that this PR does not definitively attribute the issue to negative caching. Instead, it prioritises the substantial improvement achieved through the new approach. As a precaution, the polling interval and overall test timeout have been adjusted to 2 seconds and 10 minutes, respectively, to enhance test success rates across diverse cloud environments.

This PR represents a practical win in terms of improved test efficiency, while acknowledging potential environmental differences for further investigation, if needed, in the future.

Original bug: https://bugzilla.redhat.com/show_bug.cgi?id=1912413

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@lihongan
Copy link
Contributor

/test e2e-aws-ovn-upi

Copy link
Contributor

openshift-ci bot commented Jan 12, 2024

@lihongan: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test e2e-aws-jenkins
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-image-registry
  • /test e2e-aws-ovn-serial
  • /test e2e-gcp-ovn
  • /test e2e-gcp-ovn-builds
  • /test e2e-gcp-ovn-image-ecosystem
  • /test e2e-gcp-ovn-upgrade
  • /test e2e-metal-ipi-ovn-ipv6
  • /test images
  • /test lint
  • /test unit
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
  • /test e2e-agnostic-ovn-cmd
  • /test e2e-aws
  • /test e2e-aws-csi
  • /test e2e-aws-disruptive
  • /test e2e-aws-etcd-recovery
  • /test e2e-aws-multitenant
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-cgroupsv2
  • /test e2e-aws-ovn-etcd-scaling
  • /test e2e-aws-ovn-kubevirt
  • /test e2e-aws-ovn-single-node
  • /test e2e-aws-ovn-single-node-serial
  • /test e2e-aws-ovn-single-node-upgrade
  • /test e2e-aws-ovn-upgrade
  • /test e2e-aws-proxy
  • /test e2e-azure
  • /test e2e-azure-ovn-etcd-scaling
  • /test e2e-baremetalds-kubevirt
  • /test e2e-gcp-csi
  • /test e2e-gcp-disruptive
  • /test e2e-gcp-fips-serial
  • /test e2e-gcp-ovn-etcd-scaling
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-gcp-ovn-techpreview
  • /test e2e-gcp-ovn-techpreview-serial
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-sdn
  • /test e2e-metal-ipi-serial
  • /test e2e-metal-ipi-serial-ovn-ipv6
  • /test e2e-metal-ipi-virtualmedia
  • /test e2e-openstack-ovn
  • /test e2e-openstack-serial
  • /test e2e-vsphere
  • /test e2e-vsphere-ovn-etcd-scaling
  • /test okd-e2e-gcp

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd
  • pull-ci-openshift-origin-master-e2e-aws-csi
  • pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2
  • pull-ci-openshift-origin-master-e2e-aws-ovn-fips
  • pull-ci-openshift-origin-master-e2e-aws-ovn-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade
  • pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-csi
  • pull-ci-openshift-origin-master-e2e-gcp-ovn
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-origin-master-e2e-metal-ipi-sdn
  • pull-ci-openshift-origin-master-e2e-openstack-ovn
  • pull-ci-openshift-origin-master-images
  • pull-ci-openshift-origin-master-lint
  • pull-ci-openshift-origin-master-unit
  • pull-ci-openshift-origin-master-verify
  • pull-ci-openshift-origin-master-verify-deps

In response to this:

/test e2e-aws-ovn-upi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@candita
Copy link
Contributor

candita commented Jan 12, 2024

This probably requires openshift/cloud-provider-aws#57

This commit tackles an intermittent issue found in AWS environments
during the router's h2spec conformance tests, specifically relating to
slower hostname resolution within the cluster that often results in
test timeouts. This slower resolution time in AWS, as compared to
Azure or GCP, suggests possible environmental differences in DNS
handling.

The solution involves resolving the hostname on the test host before
initiating the h2spec tests in the cluster. Implementing this change
leads to a significant improvement in the speed and consistency of
test executions. With this method, the h2spec test now completes in
~80 seconds, markedly faster than the previous 376 seconds (i.e., just
over the 5-minute mark).

This observation, particularly when considering that using an
alternative DNS resolver like @1.1.1.1 on the node yields nearly
instant results for the same hostname, suggests distinctive DNS
resolution characteristics within AWS clusters. It doesn't
definitively attribute the issue to negative caching. To adapt to this
variability, I have adjusted the polling interval to 2 seconds and the
overall test timeout to 10 minutes. These changes aim to improve test
success rates across diverse cloud environments.

With these changes I consistently see the h2spec test on AWS
completing in ~85 seconds.

  Ran 1 of 1 Specs in 77.519 seconds
  Ran 1 of 1 Specs in 90.507 seconds
  Ran 1 of 1 Specs in 80.268 seconds

and without the change it appears to be very consistently 376 seconds.
@candita
Copy link
Contributor

candita commented Feb 15, 2024

/retest-required

@candita
Copy link
Contributor

candita commented Feb 15, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 15, 2024
Copy link
Contributor

openshift-ci bot commented Feb 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita, frobware

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@candita
Copy link
Contributor

candita commented Feb 16, 2024

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 16, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 7812f3c and 2 for PR HEAD 00ea63b in total

@frobware
Copy link
Contributor Author

/hold

I think the consensus was that this PR still requires openshift/cloud-provider-aws#57.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 16, 2024
@frobware
Copy link
Contributor Author

frobware commented May 9, 2024

@frobware
Copy link
Contributor Author

frobware commented May 9, 2024

/hold

I think the consensus was that this PR still requires openshift/cloud-provider-aws#57.

57^ has merged.

/test all

@frobware
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

@openshift-ci-robot
Copy link

@openshift-bot: This pull request references Jira Issue OCPBUGS-26601, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 17, 2024
@openshift-ci-robot
Copy link

@openshift-bot: This pull request references Jira Issue OCPBUGS-26601, which is invalid:

  • expected the bug to target either version "4.17." or "openshift-4.17.", but it targets "4.16.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@frobware
Copy link
Contributor Author

/test all

Copy link
Contributor

openshift-ci bot commented May 22, 2024

@frobware: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-serial 00ea63b link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-single-node 00ea63b link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-fips 00ea63b link true /test e2e-aws-ovn-fips
ci/prow/e2e-aws-ovn-upgrade 00ea63b link false /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-single-node-upgrade 00ea63b link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 00ea63b

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade High
[sig-apps] job-upgrade
This test has passed 100.00% of 272 runs on jobs ['periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade'] in the last 14 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants