Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-JIRA: Tolerate restarts for kubevirt external infra #3451

Conversation

davidvossel
Copy link
Contributor

This PR allows the test to tolerate a single restart only for KubeVirt when running on external infra. The centralized KubeVirt infra test still does not tolerate any unexpected restarts.


The KubeVirt platform has two modes, centralized infra (where HCP and VMs run on the same OCP cluster, and external infra (Where HCP and VMs run on separate OCP clusters)

When we test external infra, we are running HCP KubeVirt running within HCP KubeVirt. This is a complex environment that is difficult to ensure has predictable performance. We occasionally see that random pods in the HCP namespace restart in this nested environment due to "Error: context deadline exceeded" being reported by the kubelet. This is likely a result of etcd latency within this environment.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 22, 2024
@openshift-ci-robot
Copy link

@davidvossel: This pull request explicitly references no jira issue.

In response to this:

This PR allows the test to tolerate a single restart only for KubeVirt when running on external infra. The centralized KubeVirt infra test still does not tolerate any unexpected restarts.


The KubeVirt platform has two modes, centralized infra (where HCP and VMs run on the same OCP cluster, and external infra (Where HCP and VMs run on separate OCP clusters)

When we test external infra, we are running HCP KubeVirt running within HCP KubeVirt. This is a complex environment that is difficult to ensure has predictable performance. We occasionally see that random pods in the HCP namespace restart in this nested environment due to "Error: context deadline exceeded" being reported by the kubelet. This is likely a result of etcd latency within this environment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added area/testing Indicates the PR includes changes for e2e testing approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Jan 22, 2024
Copy link
Contributor

@nunnatsa nunnatsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2024
Copy link
Contributor

openshift-ci bot commented Jan 23, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel, nunnatsa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 9b08dcf and 2 for PR HEAD 221de09 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6c84753 and 1 for PR HEAD 221de09 in total

@nunnatsa
Copy link
Contributor

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD d7b8d75 and 0 for PR HEAD 221de09 in total

@openshift-ci-robot
Copy link

/hold

Revision 221de09 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 24, 2024
@davidvossel
Copy link
Contributor Author

/retest-required

@nunnatsa
Copy link
Contributor

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 8d96c1a and 2 for PR HEAD 221de09 in total

@nunnatsa
Copy link
Contributor

/retest

@davidvossel
Copy link
Contributor Author

/test verify

Signed-off-by: David Vossel <davidvossel@gmail.com>
@davidvossel davidvossel force-pushed the tolerate-external-infra-restart branch from 221de09 to e9904a3 Compare January 25, 2024 14:48
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 25, 2024
@qinqon
Copy link
Contributor

qinqon commented Jan 25, 2024

/lgtm

1 similar comment
@nunnatsa
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 25, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 83fb2da and 2 for PR HEAD e9904a3 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 330501b and 1 for PR HEAD e9904a3 in total

@openshift-merge-bot openshift-merge-bot bot merged commit 05168af into openshift:main Jan 26, 2024
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants