-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sriov, Let SyncVMI DeadlineExceeded warning be non fatal in tests #5002
Conversation
Ignoring warnings may mask real issues and we do not want that. If the specific warning that is detected is an acceptable one, then only it should be accepted and nothing more. Please also include the failure percentage due to this issue. |
I think in this case we can create that mechanism in a following PR, and focus atm on the stuff that is more important (at least in my view). Its just a warning, it should not fail the test, a better approach maybe is just to show the events in case the tests fails (unless it already shows the events) imo. Other tests ignore warnings as well, and sriov meant to test SRIOV, not the VM spinning. EDIT:
The mechanism to ignoreWarnings existed before this PR, and used in several e2e test files. This PR suggest to ignore the warning, but it will still log it, and have it in the events (it checks for warning in parallel if requested else it just log the warning, and assert for errors, only thing that is worthy imo is to double check that its indeed in parallel). |
Addressed comments |
/hold |
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pass, is good
Addressed the resolved comments, (with little additional refactor) |
/retest |
1 similar comment
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good now
/retest |
1 similar comment
/retest |
/assign @enp0s3 |
In case running multi sriov jobs on CI, transient errors such as [1] "unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded" Kubevirt itself has a re-enqueuing mechanism for this kind of errors. Kubevirt e2e tests already have a mechanism to ignore warnings as well, used on some of the e2e tests. We are using Kind with a non official mode, DinD (and even with ramFS for the etcd). As such, according the community there isn't a best practice for this kind of problem, and there are some open issues that point to resource extensive usage. Add a mechanism to allow switching selected warnings to be non fatal. Use this mechanism in order to update this warning severity to log only instead of failing the test. [1] https://prow.apps.ovirt.org/view/gcs/kubevirt-prow/pr-logs/pull/kubevirt_project-infra/822/rehearsal-pull-kubevirt-e2e-kind-1.17-sriov/1348156475482050563 See kubevirt#5027 Signed-off-by: Or Shoval <oshoval@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! looks good, please see my comments below.
A general concern is that we don't have unit tests for our test tooling and we are adding more
functionality, but let's leave this task for a separate PR.
Rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enp0s3 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
3 similar comments
/retest |
/retest |
/retest |
In case running multi sriov jobs on CI,
transient warnings such as the following, occur [1]
unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded
It happens only when there are 2+ jobs at the same time, and if each of the job runs
2 VMs at the same time.
In this case it happens to around 30% of the jobs.
Kubevirt itself has a re-enqueuing mechanism for this kind of warnings,
so their effect is transient, and the test would pass once continued.
Add a mechanism to allow switching selected warnings
to be non fatal.
Use this mechanism in order to update this warning severity
to log only instead of failing the test.
There are some open issues that point
its related to resource extensive usage [2] (tried the suggested method):
once there are lots of vms spinning, each vm needs its own cpu quota,
1 cpu goes to kube-reserve (0.5) and system-reserve (0.5) according kind config
which leaves us with 1 cpu total (i tried allocating 2 cpus total for the cluster),
and it seems its not enough.
See also [3] about open issues with system reserve,
and why its better to not try and solve it at this very moment, for kind.
Once we bump k8s to 1.19 in kind it might help, unless it affects only windows [4].
The timeout happens between the virt-handler and virt-launcher gRPC.
See [5] for more info.
[1]
https://prow.apps.ovirt.org/view/gcs/kubevirt-prow/pr-logs/pull/kubevirt_project-infra/822/rehearsal-pull-kubevirt-e2e-kind-1.17-sriov/1348156475482050563
[2] etcd-io/etcd#12234 (comment)
[3] kubernetes/kubernetes#72881
[4] kubernetes/kubernetes#95735 (comment)
[5] #5027
Signed-off-by: Or Shoval oshoval@redhat.com