sriov, Let SyncVMI DeadlineExceeded warning be non fatal in tests #5002

oshoval · 2021-02-10T12:17:54Z

In case running multi sriov jobs on CI,
transient warnings such as the following, occur [1]
unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded
It happens only when there are 2+ jobs at the same time, and if each of the job runs
2 VMs at the same time.
In this case it happens to around 30% of the jobs.

Kubevirt itself has a re-enqueuing mechanism for this kind of warnings,
so their effect is transient, and the test would pass once continued.

Add a mechanism to allow switching selected warnings
to be non fatal.
Use this mechanism in order to update this warning severity
to log only instead of failing the test.

There are some open issues that point
its related to resource extensive usage [2] (tried the suggested method):
once there are lots of vms spinning, each vm needs its own cpu quota,
1 cpu goes to kube-reserve (0.5) and system-reserve (0.5) according kind config
which leaves us with 1 cpu total (i tried allocating 2 cpus total for the cluster),
and it seems its not enough.

See also [3] about open issues with system reserve,
and why its better to not try and solve it at this very moment, for kind.

Once we bump k8s to 1.19 in kind it might help, unless it affects only windows [4].

The timeout happens between the virt-handler and virt-launcher gRPC.
See [5] for more info.

[1]
https://prow.apps.ovirt.org/view/gcs/kubevirt-prow/pr-logs/pull/kubevirt_project-infra/822/rehearsal-pull-kubevirt-e2e-kind-1.17-sriov/1348156475482050563

[2] etcd-io/etcd#12234 (comment)
[3] kubernetes/kubernetes#72881
[4] kubernetes/kubernetes#95735 (comment)

[5] #5027

Signed-off-by: Or Shoval oshoval@redhat.com

None

oshoval · 2021-02-10T12:19:20Z

/cc @EdDev @qinqon

EdDev · 2021-02-10T13:18:24Z

Ignoring warnings may mask real issues and we do not want that.
The reconciliation mechanism is there to recover from problems, but in tests we are suppose to control those or know what to expect explicitly.

If the specific warning that is detected is an acceptable one, then only it should be accepted and nothing more.
A detail explanation of what it is and why we can accept it under a specific condition is to be given (it has been given here partially, mainly through links).

Please also include the failure percentage due to this issue.

oshoval · 2021-02-10T13:39:27Z

Ignoring warnings may mask real issues and we do not want that.
The reconciliation mechanism is there to recover from problems, but in tests we are suppose to control those or know what to expect explicitly.

If the specific warning that is detected is an acceptable one, then only it should be accepted and nothing more.
A detail explanation of what it is and why we can accept it under a specific condition is to be given (it has been given here partially, mainly through links).

Please also include the failure percentage due to this issue.

I think in this case we can create that mechanism in a following PR, and focus atm on the stuff that is more important (at least in my view).
Running several clusters with several VMs at the same time is resource extensive, so this specific warning occurs (will elaborate more after deeper research upon need).
About masking other warnings, the fact that there is retry mechanism in kubevirt and that e2e already have an option to ignore them, and that those warnings are transient, means imho, that we shouldn't be blocked in this very moment for presenting a new mechanism that filter specific warnings.
We will do it, but after the current effort is done if possible please.
@phoracek @EdDev @qinqon
Wdyt please, can we postpone the mechanism for a following PR, and focus on the critical path (and the right one imho) ?

Its just a warning, it should not fail the test, a better approach maybe is just to show the events in case the tests fails (unless it already shows the events) imo.

Other tests ignore warnings as well, and sriov meant to test SRIOV, not the VM spinning.
Moreover we are doing some relaxations such as etcd in ramFs, because its kind.
I think this change is also legit, at least at phase 1.
Doing phase 2 in the right time will allow us to come more prepared for it usually.

EDIT:
Moreover warnings should not fail the tests, its not an error.
Something that worth investing imo, is to add a mechanism to print the events, unless there is one already
and then in case a warning occurred and the test failed eventually, the warning will be visible there.
Update:
We do have the warning already in artifacts events file

           "reason": "SyncFailed",
            "message": "unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded",
            "source": {
                "component": "virt-handler",
                "host": "sriov-worker"
            },

The mechanism to ignoreWarnings existed before this PR, and used in several e2e test files.
WaitForSuccessfulVMIStartIgnoreWarnings
https://github.com/doc22940/kubevirt/search?q=WaitForSuccessfulVMIStartIgnoreWarnings
The mechanism:
https://github.com/doc22940/kubevirt/blob/ad8b5db41598834539be5ab83c12c0df099b4714/tests/utils.go#L2367

This PR suggest to ignore the warning, but it will still log it, and have it in the events (it checks for warning in parallel if requested else it just log the warning, and assert for errors, only thing that is worthy imo is to double check that its indeed in parallel).
https://github.com/doc22940/kubevirt/blob/ad8b5db41598834539be5ab83c12c0df099b4714/tests/utils.go#L335

oshoval · 2021-02-11T15:22:52Z

Addressed comments

oshoval · 2021-02-12T07:27:37Z

/hold
We can still review please, but i will release the hold once its checked with new code of multi sriov on CI.
(i did check locally but with injection)

oshoval · 2021-02-14T12:00:41Z

/hold cancel
tested on CI

qinqon

first pass, is good

tests/utils.go

oshoval · 2021-02-21T08:39:33Z

Addressed the resolved comments, (with little additional refactor)
Thanks

oshoval · 2021-02-24T07:33:46Z

/retest

oshoval · 2021-02-24T08:51:50Z

/retest

qinqon

All good now

oshoval · 2021-02-24T11:35:08Z

Thanks @qinqon @EdDev

oshoval · 2021-02-24T13:16:23Z

/retest

oshoval · 2021-02-24T15:34:39Z

/retest

oshoval · 2021-03-03T11:23:24Z

/assign @enp0s3

In case running multi sriov jobs on CI, transient errors such as [1] "unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded" Kubevirt itself has a re-enqueuing mechanism for this kind of errors. Kubevirt e2e tests already have a mechanism to ignore warnings as well, used on some of the e2e tests. We are using Kind with a non official mode, DinD (and even with ramFS for the etcd). As such, according the community there isn't a best practice for this kind of problem, and there are some open issues that point to resource extensive usage. Add a mechanism to allow switching selected warnings to be non fatal. Use this mechanism in order to update this warning severity to log only instead of failing the test. [1] https://prow.apps.ovirt.org/view/gcs/kubevirt-prow/pr-logs/pull/kubevirt_project-infra/822/rehearsal-pull-kubevirt-e2e-kind-1.17-sriov/1348156475482050563 See kubevirt#5027 Signed-off-by: Or Shoval <oshoval@redhat.com>

enp0s3

Thank you! looks good, please see my comments below.
A general concern is that we don't have unit tests for our test tooling and we are adding more
functionality, but let's leave this task for a separate PR.

tests/utils.go

oshoval · 2021-03-03T13:49:16Z

Rebased

enp0s3

Thank you
/approve

kubevirt-bot · 2021-03-03T14:23:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enp0s3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~tests/OWNERS~~ [enp0s3]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kubevirt-commenter-bot · 2021-03-04T08:55:07Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2021-03-04T23:24:05Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2021-03-05T22:24:06Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2021-03-06T09:46:09Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-bot requested review from ILpinto and kbidarkar February 10, 2021 12:18

kubevirt-bot requested review from EdDev and qinqon February 10, 2021 12:19

oshoval force-pushed the deadline branch from 8979b24 to 1deafa4 Compare February 11, 2021 15:08

kubevirt-bot added size/M and removed size/S labels Feb 11, 2021

oshoval changed the title ~~sriov, Ignore VM transient warnings~~ sriov, Ignore SyncVMI DeadlineExceeded warning Feb 11, 2021

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 12, 2021

kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 14, 2021

oshoval force-pushed the deadline branch from 1deafa4 to 4fe59f3 Compare February 14, 2021 15:30

oshoval changed the title ~~sriov, Ignore SyncVMI DeadlineExceeded warning~~ sriov, Let SyncVMI DeadlineExceeded warning be non fatal in tests Feb 15, 2021

oshoval mentioned this pull request Feb 17, 2021

Storage Starting a VirtualMachineInstance [rfe_id:3106][crit:medium][vendor:cnv-qe@redhat.com][level:component]with Alpine PVC should be successfully started with NFS Disk PVC using ipv4 address of the NFS pod not owned by qemu #4962

Closed

qinqon suggested changes Feb 18, 2021

View reviewed changes

tests/utils.go Show resolved Hide resolved

tests/utils.go Outdated Show resolved Hide resolved

tests/utils.go Outdated Show resolved Hide resolved

kubevirt-bot assigned qinqon Feb 18, 2021

oshoval force-pushed the deadline branch from 4fe59f3 to 09896a6 Compare February 21, 2021 08:38

oshoval requested a review from qinqon February 21, 2021 09:42

oshoval requested a review from qinqon February 23, 2021 10:59

qinqon approved these changes Feb 24, 2021

View reviewed changes

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 24, 2021

kubevirt-bot assigned enp0s3 Mar 3, 2021

kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2021

enp0s3 suggested changes Mar 3, 2021

View reviewed changes

tests/utils.go Show resolved Hide resolved

tests/utils.go Show resolved Hide resolved

tests/utils.go Show resolved Hide resolved

kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 3, 2021

oshoval force-pushed the deadline branch from 8fb3bef to 39e7402 Compare March 3, 2021 13:48

kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2021

oshoval requested review from qinqon and enp0s3 March 3, 2021 14:09

enp0s3 approved these changes Mar 3, 2021

View reviewed changes

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 3, 2021

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2021

kubevirt-bot merged commit 112dd92 into kubevirt:master Mar 7, 2021

oshoval mentioned this pull request Mar 25, 2021

SyncVMI DeadlineExceeded warnings when running parallel sriov jobs on CI #5027

Closed

oshoval mentioned this pull request Mar 23, 2023

e2e, SR-IOV: Ignore SyncVMI timeout warning #9490

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sriov, Let SyncVMI DeadlineExceeded warning be non fatal in tests #5002

sriov, Let SyncVMI DeadlineExceeded warning be non fatal in tests #5002

oshoval commented Feb 10, 2021 •

edited

oshoval commented Feb 10, 2021

EdDev commented Feb 10, 2021

oshoval commented Feb 10, 2021 •

edited

oshoval commented Feb 11, 2021

oshoval commented Feb 12, 2021 •

edited

oshoval commented Feb 14, 2021

qinqon left a comment

oshoval commented Feb 21, 2021

oshoval commented Feb 24, 2021

oshoval commented Feb 24, 2021

qinqon left a comment

oshoval commented Feb 24, 2021

oshoval commented Feb 24, 2021

oshoval commented Feb 24, 2021

oshoval commented Mar 3, 2021

enp0s3 left a comment •

edited

oshoval commented Mar 3, 2021

enp0s3 left a comment

kubevirt-bot commented Mar 3, 2021

kubevirt-commenter-bot commented Mar 4, 2021

kubevirt-commenter-bot commented Mar 4, 2021

kubevirt-commenter-bot commented Mar 5, 2021

kubevirt-commenter-bot commented Mar 6, 2021

sriov, Let SyncVMI DeadlineExceeded warning be non fatal in tests #5002

sriov, Let SyncVMI DeadlineExceeded warning be non fatal in tests #5002

Conversation

oshoval commented Feb 10, 2021 • edited

oshoval commented Feb 10, 2021

EdDev commented Feb 10, 2021

oshoval commented Feb 10, 2021 • edited

oshoval commented Feb 11, 2021

oshoval commented Feb 12, 2021 • edited

oshoval commented Feb 14, 2021

qinqon left a comment

Choose a reason for hiding this comment

oshoval commented Feb 21, 2021

oshoval commented Feb 24, 2021

oshoval commented Feb 24, 2021

qinqon left a comment

Choose a reason for hiding this comment

oshoval commented Feb 24, 2021

oshoval commented Feb 24, 2021

oshoval commented Feb 24, 2021

oshoval commented Mar 3, 2021

enp0s3 left a comment • edited

Choose a reason for hiding this comment

oshoval commented Mar 3, 2021

enp0s3 left a comment

Choose a reason for hiding this comment

kubevirt-bot commented Mar 3, 2021

kubevirt-commenter-bot commented Mar 4, 2021

kubevirt-commenter-bot commented Mar 4, 2021

kubevirt-commenter-bot commented Mar 5, 2021

kubevirt-commenter-bot commented Mar 6, 2021

oshoval commented Feb 10, 2021 •

edited

oshoval commented Feb 10, 2021 •

edited

oshoval commented Feb 12, 2021 •

edited

enp0s3 left a comment •

edited